What Media Foundation Is - Why It Starts to Feel Like COM and Windows Media APIs at the Same Time
When you first start touching Media Foundation, it often feels like this:
“I thought I was using a Windows video / audio API, but suddenly I am drowning in COM.”
That reaction is natural.
Once CoInitializeEx, MFStartup, IMFSourceReader, IMFMediaType, IMFTransform, IMFActivate, HRESULT, and GUID-heavy configuration all show up at once, the whole thing starts to feel much more like Win32 / COM than a simple media library.
This article does not try to cover the entire Media Foundation API like a dictionary. Instead, it organizes three practical questions first:
- why COM concepts naturally appear when you use Media Foundation
- where the COM flavor becomes especially strong
- where to enter first among Source Reader / Sink Writer / Media Session / MFT
The code examples are C++-based, but the way of thinking is almost the same even when you reach Media Foundation from .NET through a wrapper.
Contents
- Short version
- The first orientation table
- The overall shape of Media Foundation
- The places where Media Foundation starts looking like COM
- 4.1.
CoInitializeExandMFStartupappear next to each other at initialization - 4.2. Object boundaries are mostly interface-based
- 4.3. Activation objects appear
- 4.4. Configuration and media type information revolve around
IMFAttributesand GUIDs - 4.5. Async, callbacks, and threading are also handled in a COM-like way
- 4.6. But Media Foundation is not “just COM”
- 4.1.
- Rough rule-of-thumb usage
- A practical checklist
- Code excerpts
- Summary
- References
1. Short version
- Media Foundation is a platform for handling audio and video. The whole API surface is not simply pure COM.
- But the boundaries between source / transform / sink / activation / attributes / callbacks are expressed through COM interfaces, so once you use it,
IUnknown,HRESULT, GUIDs, and apartment/threading discussions naturally appear. - It is usually easiest to start with Source Reader / Sink Writer first, move to Media Session when you need playback control, and move into MFT when you need your own transform component.
In other words, Media Foundation is a media-processing platform whose boundaries are deeply shaped by COM.
Once you understand that first, it becomes much easier to see why the API suddenly starts wearing a COM face.
2. The first orientation table
2.1. What to touch first depending on your goal
This first table makes it easier to choose an entry point.
| What you want to do | First thing to touch | How strong the COM feel is | Notes |
|---|---|---|---|
| Pull frames / samples from a file or camera | Source Reader | Medium | It can also take care of decoder loading when needed |
| Write generated audio / video to a file | Sink Writer | Medium | It can manage encoders and media sinks together |
| Handle playback, pause, seek, A/V sync, and quality control | Media Session | High | You need to understand topology and session concepts |
| Insert your own transform or codec-like processing component | MFT | High | The core interface becomes IMFTransform |
| Enumerate candidates first and instantiate only the one you need | IMFActivate |
High | The thing you get back may not be the real object yet |
2.2. Where the COM face appears
| Place | What shows up | What to understand first |
|---|---|---|
| Initialization | CoInitializeEx, MFStartup |
COM initialization and Media Foundation initialization are separate |
| Object creation / transfer | IMFSourceReader, IMFMediaType, IMFTransform |
Most things are interface pointers plus HRESULT |
| Configuration | IMFAttributes, GUIDs |
Settings and media type information are represented as key/value data plus GUIDs |
| Enumeration / delayed instantiation | IMFActivate, ActivateObject |
Enumeration results are not always the final object |
| Async flow | IMFSourceReaderCallback, work queues |
You need to think about callbacks and apartments |
| Playback control | topology, Media Session | The flow of the whole pipeline is a Media Foundation-specific concept |
2.3. Terms worth understanding first
| Term | Meaning here |
|---|---|
| Media Source | The entry point where media data enters the pipeline. This can be a file, network source, capture device, and so on |
| MFT | Media Foundation Transform. The common model for decoders, encoders, and video/audio processing blocks |
| Media Sink | The destination for media data, such as playback, output, or file writing |
| Media Session | The component that manages the pipeline as a whole, especially playback and synchronization |
| Topology | A graph that describes how source / transform / sink nodes are connected |
| Activation Object | A helper object used to instantiate the real object later. In Media Foundation this is usually represented by IMFActivate |
| Attributes | A key/value store based on GUID keys. Media Foundation uses this everywhere |
If you keep these terms in mind first, the documentation becomes much easier to read.
3. The overall shape of Media Foundation
At a high level, Media Foundation is about a media pipeline. COM matters, but it is often easier to understand the overall shape first.
flowchart TB
subgraph Pipeline["Using the whole pipeline"]
Source1["Media Source"] --> Transform1["MFT"]
Transform1 --> Sink1["Media Sink"]
Session["Media Session"] --- Source1
Session --- Transform1
Session --- Sink1
end
subgraph Direct["Application handles data more directly"]
Source2["Media Source"] --> Reader["Source Reader (+ decoder)"]
App["Application"] --> Writer["Sink Writer (+ encoder)"]
Writer --> Sink2["Media Sink"]
end
There are roughly two ways to use Media Foundation:
- Use the pipeline as a whole
- connect source / transform / sink and let Media Session manage data flow and A/V synchronization
- Handle the data more directly in the application
- pull data from a source with Source Reader and push it toward an output with Sink Writer
The latter is often easier when you want to process frames or samples yourself. The former is the more natural path if you want the platform to handle playback and synchronization.
The important point is that Media Foundation is a media-processing platform, not simply a bag of COM objects you poke directly.
But once you start looking at the boundaries between those components, the COM face becomes much stronger. That is the next section.
4. The places where Media Foundation starts looking like COM
4.1. CoInitializeEx and MFStartup appear next to each other at initialization
This is one of the first things that feels strange.
Before you can “just open a file” or “just pull data from a camera,” you see both CoInitializeEx and MFStartup.
CoInitializeExinitializes COMMFStartupinitializes the Media Foundation platform
So COM initialization alone is not enough. That is often the first moment where it becomes obvious that this is not merely “a video API” but a platform with a significant COM-based contract underneath.
In practical code, it helps to decide these things early:
- which thread will use Media Foundation
- whether that thread will be STA or MTA
- who owns
MFStartup/MFShutdownandCoInitializeEx/CoUninitialize
If you keep that vague, callback and UI-integration problems become much harder to understand later.
4.2. Object boundaries are mostly interface-based
Once you start reading Media Foundation APIs, many return values and out parameters are COM interfaces:
IMFSourceReaderIMFMediaTypeIMFTransformIMFActivateIMFSampleIMFMediaBuffer
What matters here is that not only the media data itself, but also type descriptions and configuration objects are represented as interfaces.
For example:
IMFTransformrepresents an MFTIMFAttributesis a key/value storeIMFMediaTypeis essentially a media-format description built on top ofIMFAttributes
So even something that looks like “configuration data” is still shaped as a COM interface.
That is why IUnknown, QueryInterface, AddRef / Release, and HRESULT naturally enter the picture.
flowchart TD
IUnknown["IUnknown"]
IUnknown --> IMFAttributes["IMFAttributes"]
IMFAttributes --> IMFMediaType["IMFMediaType"]
IMFAttributes --> IMFActivate["IMFActivate"]
IUnknown --> IMFSourceReader["IMFSourceReader"]
IUnknown --> IMFTransform["IMFTransform"]
At this point it becomes much easier to say:
“Media Foundation is a media API, but its boundaries are very COM-shaped.”
4.3. Activation objects appear
One of the places where the COM flavor becomes especially visible is the activation-object model.
IMFActivate is a helper object used to create the real object later.
The easiest mental model is that it feels somewhat like a COM class factory.
This shows up in places where enumeration APIs do not immediately return a directly usable transform or sink object.
Instead, they first return an array of IMFActivate*.
Then you inspect the candidate, and only later call ActivateObject to create the real instance.
sequenceDiagram
participant App as App
participant Enum as Enumeration API
participant Act as IMFActivate
participant Obj as IMFTransform / Sink / etc.
App->>Enum: enumerate candidates
Enum-->>App: array of IMFActivate*
App->>Act: inspect attributes
App->>Act: ActivateObject(...)
Act-->>App: real COM object
This structure fits well with how Media Foundation wants to:
- discover interchangeable building blocks
- inspect them
- and instantiate only what is actually needed
Since activation objects can also carry attributes, the flow naturally becomes:
- inspect candidate metadata
- possibly configure it
- instantiate later
That is a very COM-like shape.
4.4. Configuration and media type information revolve around IMFAttributes and GUIDs
Another moment when Media Foundation suddenly feels “very COM” is when settings turn into a forest of GUIDs.
The center of that forest is IMFAttributes.
IMFAttributes is a GUID-keyed key/value store, and Media Foundation uses it constantly.
The especially important thing is IMFMediaType.
IMFMediaType inherits from IMFAttributes and stores media-format information as attributes.
Typical examples include:
- major type (audio or video)
- subtype (
H.264,AAC,RGB32,PCM, and so on) - frame size
- frame rate
- sample rate
- channel count
flowchart LR
MediaType["IMFMediaType"] --> Major["MF_MT_MAJOR_TYPE"]
MediaType --> Subtype["MF_MT_SUBTYPE"]
MediaType --> Detail["frame size / FPS / sample rate / etc."]
It is easy to experience this as “the forest of GUIDs.” But the underlying design is actually straightforward:
- use an attribute store to hold configuration
- represent media types through the same attribute-store model
- negotiate formats between source / transform / sink by reading and matching those attributes
So the real point is:
Media Foundation uses COM-like interfaces and GUIDs to represent configuration and media-type metadata.
4.5. Async, callbacks, and threading are also handled in a COM-like way
Another place that is easy to underestimate in practice is asynchronous behavior and thread model.
For example, Source Reader is synchronous by default.
In synchronous mode, ReadSample blocks.
Depending on the file, network, or device, that blocking time can become very visible.
If you want asynchronous mode, you create the Source Reader with a callback. That means:
- implement an
IMFSourceReaderCallback - put that object into the
MF_SOURCE_READER_ASYNC_CALLBACKattribute - then create the Source Reader
There is also a threading point that matters a lot: Media Foundation asynchronous processing uses a work queue, and the Media Foundation work-queue threads are MTA.
That means it is often simpler to keep the application side in MTA as well.
sequenceDiagram
participant App as App thread
participant Reader as Source Reader
participant Queue as MF work queue (MTA)
participant Cb as IMFSourceReaderCallback
App->>Reader: ReadSample(...)
Reader-->>App: returns immediately
Reader->>Queue: internal processing
Queue->>Cb: OnReadSample(...)
The practical points are:
- do not touch STA UI objects directly inside the callback
- make the callback implementation thread-safe
- if UI updates are needed, return only the result to the UI thread
- decide early which threads are expected to receive Media Foundation callbacks
Media Foundation does not magically absorb your STA assumptions for you. So in many applications it is simpler to:
keep Media Foundation worker code on the MTA side, and build an explicit bridge back to the UI.
4.6. But Media Foundation is not “just COM”
At this point, it is easy to think:
“So Media Foundation is basically COM.”
That is not quite right.
Media Foundation has concepts that go beyond generic COM:
MFStartup/MFShutdown- Media Session
- topology
- topology loader
- presentation clock
- Source Reader / Sink Writer
These are not merely COM concepts. They are Media Foundation’s own platform-level model for handling media pipelines.
For example, in Media Session, the application can provide a partial topology, and the topology loader can resolve it into a full topology by supplying the transforms that are needed. That is not just “COM in general.” It is Media Foundation behaving as a media-processing platform.
flowchart LR
Partial["Partial Topology<br/>Source -> Output"] --> Loader["Topology Loader"]
Loader --> Full["Full Topology<br/>Source -> Decoder MFT -> Output"]
So the easiest way to understand it is as a two-layer picture:
- Media Foundation uses COM to express component contracts
- on top of that, it acts as a media-processing platform with its own pipeline concepts
That framing makes the API much easier to digest.
5. Rough rule-of-thumb usage
This first branching is enough surprisingly often:
flowchart TD
Start["What do you want to do?"] --> Q1{"What is the first thing you need?"}
Q1 -- "I want to read frames / samples" --> A1["Source Reader"]
Q1 -- "I want to write media to a file" --> A2["Sink Writer"]
Q1 -- "I need playback control and A/V sync" --> A3["Media Session"]
Q1 -- "I want to insert a custom transform" --> A4["MFT"]
5.1. Start with Source Reader when you want to pull samples
Source Reader is a very approachable entry point when you want to pull media data out of a file or device.
Good fits include:
- extracting frames from a video file
- decoding audio samples
- pulling frames from a camera
- feeding Media Foundation sources into your own processing pipeline
Source Reader can take care of decoder loading when needed. But it does not manage presentation clocks, A/V synchronization, or actual playback for you.
It is easiest to think of it as:
an entry point for reading data, not an entry point for full playback.
5.2. Use Sink Writer when you want to write files
Sink Writer is the output-side counterpart.
Good fits include:
- writing generated frames into a video file
- encoding audio samples
- converting one format into another and storing the result
Sink Writer can also bring encoders and media sinks into the flow when necessary.
5.3. Use Media Session when playback and synchronization matter
If your real goal is not just “pull samples” but “play media properly,” then Media Session is usually the right center.
It is a good fit when you need:
- play / pause / seek
- audio/video synchronization
- quality control across the pipeline
- topology-based connection of sources / transforms / sinks
At this level, you are much closer to Media Foundation itself than to the simpler Source Reader / Sink Writer helpers.
5.4. Use MFT when you need to insert your own components
MFT is the common transform model in Media Foundation.
You enter this world when you want to:
- build your own decoder or encoder
- insert a custom audio / video processing block
- enumerate and choose transform components yourself
- control the pipeline more explicitly than the automatic route
The MFT world brings IMFTransform, IMFActivate, media-type negotiation, and sample / buffer management much closer to the surface.
That is why it is usually clearer not to start there first unless you genuinely need it.
6. A practical checklist
| Item | What to decide / inspect | What tends to go wrong if you skip it |
|---|---|---|
| initialization ownership | decide where CoInitializeEx and MFStartup are called and where their shutdown counterparts live |
missing initialization, confusing shutdown order |
| apartment model | decide upfront whether the relevant threads are STA or MTA | callback confusion, UI collisions |
| Source Reader mode | decide synchronous vs asynchronous mode at creation time | unexpected blocking or impossible late-mode changes |
| media-type negotiation | explicitly enumerate and choose actual output formats | MF_E_INVALIDMEDIATYPE, unexpected format mismatches |
| object lifetime | make Release, Unlock, and shutdown responsibility explicit |
leaks, stuck buffers, shutdown inconsistencies |
| activation object handling | distinguish between a real object and an IMFActivate wrapper |
failing because you assumed the candidate was already the final object |
| topology understanding | know whether you are looking at a partial topology or a resolved full topology | getting stuck while expecting auto-connection to happen magically |
| error checking | inspect HRESULT, stream flags, and events consistently |
partial failures get missed |
| UI integration | do not let callbacks touch the UI directly; marshal the result back explicitly | hangs, races, and confusing cross-thread failures |
The three priorities that matter most are:
- choose the right entry API first
- decide the apartment model before everything grows around it
- do not treat media-type negotiation casually
7. Code excerpts
These are not meant to be complete samples.
They are short enough to show where Media Foundation starts feeling like COM.
7.1. Initialization
template <class T>
void SafeRelease(T** pp)
{
if (pp != nullptr && *pp != nullptr)
{
(*pp)->Release();
*pp = nullptr;
}
}
HRESULT InitializeMediaFoundationForCurrentThread()
{
HRESULT hr = CoInitializeEx(nullptr, COINIT_MULTITHREADED);
if (FAILED(hr))
{
return hr;
}
hr = MFStartup(MF_VERSION);
if (FAILED(hr))
{
CoUninitialize();
return hr;
}
return S_OK;
}
void UninitializeMediaFoundationForCurrentThread()
{
MFShutdown();
CoUninitialize();
}
This is one of the first points where the COM shape becomes visible:
CoInitializeExMFStartup
If some other layer already owns COM initialization, that is fine too. The important thing is to make the responsibility explicit.
7.2. Creating a Source Reader in synchronous mode
HRESULT ReadOneVideoSample(PCWSTR path)
{
IMFSourceReader* pReader = nullptr;
IMFMediaType* pType = nullptr;
IMFSample* pSample = nullptr;
HRESULT hr = MFCreateSourceReaderFromURL(path, nullptr, &pReader);
if (FAILED(hr)) goto done;
hr = MFCreateMediaType(&pType);
if (FAILED(hr)) goto done;
hr = pType->SetGUID(MF_MT_MAJOR_TYPE, MFMediaType_Video);
if (FAILED(hr)) goto done;
hr = pType->SetGUID(MF_MT_SUBTYPE, MFVideoFormat_RGB32);
if (FAILED(hr)) goto done;
hr = pReader->SetCurrentMediaType(
MF_SOURCE_READER_FIRST_VIDEO_STREAM,
nullptr,
pType);
if (FAILED(hr)) goto done;
DWORD streamFlags = 0;
LONGLONG timestamp = 0;
hr = pReader->ReadSample(
MF_SOURCE_READER_FIRST_VIDEO_STREAM,
0,
nullptr,
&streamFlags,
×tamp,
&pSample);
if (FAILED(hr)) goto done;
// Process IMFMediaBuffer extracted from pSample here
done:
SafeRelease(&pSample);
SafeRelease(&pType);
SafeRelease(&pReader);
return hr;
}
What this example makes visible is:
- the reader and media type are COM interfaces
- configuration is GUID-based
- results are carried through
HRESULT - synchronous mode means
ReadSampleblocks
Even “I just want one frame” already looks very COM-shaped at the Media Foundation boundary.
7.3. Creating a Source Reader in asynchronous mode
HRESULT CreateSourceReaderAsync(
PCWSTR path,
IMFSourceReaderCallback* pCallback,
IMFSourceReader** ppReader)
{
IMFAttributes* pAttributes = nullptr;
HRESULT hr = MFCreateAttributes(&pAttributes, 1);
if (FAILED(hr))
{
return hr;
}
hr = pAttributes->SetUnknown(MF_SOURCE_READER_ASYNC_CALLBACK, pCallback);
if (SUCCEEDED(hr))
{
hr = MFCreateSourceReaderFromURL(path, pAttributes, ppReader);
}
SafeRelease(&pAttributes);
return hr;
}
Here, asynchronous mode is selected by placing the callback into the attribute set before creating the reader.
That means:
- the callback itself is a COM interface
- async mode is configured through
IMFAttributes - the mode is decided at creation time
In practice, it is important that the IMFSourceReaderCallback implementation be thread-safe and not directly grab UI objects.
7.4. Enumerating and activating MFTs with MFTEnumEx
HRESULT FindH264Decoder(IMFTransform** ppTransform)
{
*ppTransform = nullptr;
IMFActivate** ppActivate = nullptr;
UINT32 count = 0;
MFT_REGISTER_TYPE_INFO inputType = {};
inputType.guidMajorType = MFMediaType_Video;
inputType.guidSubtype = MFVideoFormat_H264;
HRESULT hr = MFTEnumEx(
MFT_CATEGORY_VIDEO_DECODER,
MFT_ENUM_FLAG_SYNCMFT | MFT_ENUM_FLAG_LOCALMFT,
&inputType,
nullptr,
&ppActivate,
&count);
if (FAILED(hr) || count == 0)
{
return FAILED(hr) ? hr : MF_E_TOPO_CODEC_NOT_FOUND;
}
hr = ppActivate[0]->ActivateObject(IID_PPV_ARGS(ppTransform));
for (UINT32 i = 0; i < count; ++i)
{
SafeRelease(&ppActivate[i]);
}
CoTaskMemFree(ppActivate);
return hr;
}
What matters here is that enumeration does not necessarily hand you the final transform directly. You may get activation objects first, inspect them, and only then instantiate the actual transform.
That is a very characteristic part of Media Foundation’s COM-shaped design.
8. Summary
Media Foundation is not “just COM,” but it absolutely uses COM to define many of its most important boundaries.
If you understand:
- why
CoInitializeExandMFStartupsit side by side - why so many core objects are interface-based
- why activation objects, GUID attributes, callbacks, and apartment concerns appear naturally
then the API becomes much easier to read.
The simplest practical rule is:
- start with Source Reader / Sink Writer if you can
- move to Media Session when playback orchestration matters
- move into MFT only when your real problem is custom transform integration
That keeps the amount of conceptual baggage much more manageable.
9. References
Author GitHub
The author of this article, Go Komura, is on GitHub as gomurin0428 .
You can also find COM_BLAS and COM_BigDecimal there.