(old)FPGA Reference Design
The NDI Advanced SDK contains working example designs intended to assist in using the NDI Advanced software SDK and the NDI FPGA IP cores.
Building a working embedded NDI design requires wide-ranging expertise and these example designs are provided in an attempt to make this process more accessible. Prebuilt uSD card images are available for the supported platforms and details are provided regarding how to rebuild each required piece from scratch.
Quick Start
Obtain one of the supported development boards:
Digilent Arty-Z7-20(partial support, the Zybo-Z7-20 is preferred)
Boot your board using a prebuilt image
See the FPGA Quick Start Guide.pdf file for details on obtaining the uSD image, writing it to a uSD card, and using it with one of the supported platforms.
Directory structure:
The FPGA reference design is organized into the following files and subdirectories under the fpga_reference_design/ directory:
README.md
High level overview of the provided example projects
README.uSD.md
Details on using the prebuilt uSD images
CHANGELOG.md
List of notable changes
src/cpp
Software source code and example projects
src/fpga
Hardware design files and example projects
linux_kernel
Projects to build kernel and boot loader
os_uSD
Scripts to generate root filesystem and bootable uSD images
Design Overview
There are a lot of different pieces required to create a working example system:
FPGA hardware design
Boot loader
Linux kernel
Device-tree
Linux root file system
Application software
Bootable image including all of the above
While each of these components is important and necessary, not all of them have to be customized in order to create a custom project. The three components that need to be customized for almost any project are the actual FPGA hardware, the application software, and the device tree. The Linux kernel and boot loader can typically be used without customizations and there are several options for creating a Linux root file system.
This example design uses vendor specific processes to generate the boot loader, the Linux kernel, and most of the device tree. The root file system is a standard Debian install. The FPGA logic and application software are custom, and some device-tree customizations are required to support the custom FPGA hardware. Each of these components is discussed in more detail below.
Theory of Operation
Embedded systems require interaction between hardware and software. This is a complex process even on small micro-controllers, and is even more so on systems running a high-level OS such as Linux. This example design was created with the intent to make the interaction between hardware and software as simple as possible to implement and modify, with no need to write kernel-space code.
Common Encode and Decode Base Functionality
Low Level Details and Device-Tree
In embedded systems, it is necessary to communicate details of the system across several fundamentally different environments (e.g., hardware, the Linux kernel, and application software). Rather than using lots of "magic" numbers stashed in cryptic header files, this is now mostly done at the system level with device-tree. Details provided via device-tree can control which kernel modules get loaded (or do not), as well as modify the behavior of those modules.
Hardware IP
Vendor IP
Standard vendor IP blocks for GPIO and I2C are implemented along with the Hard IP ARM cores in the example design files. These vendor IP blocks are added to the device-tree and are controlled by standard Linux kernel drivers. This allows the pushbutton switches and LEDs connected to the FPGA fabric to be made available to the Linux kernel as part of the standard gpio subsystem.
Xilinx HDMI IP
The ZCU104 uses the Xilinx HDMI IP core, however the Linux drivers for this core are disabled since the NDI encoding data flow operates outside of the Linux kernel's standard video input and output systems. The control software for the HDMI core instead operates on one of the Cortex-R5 cores, which means all the hardware used by the HDMI core (uart1, i2c0, i2c1) must be disabled in Linux to avoid contention.
Altera HDMI IP
The Altera examples use the Altera HDMI IP cores. Management of these IP cores is done outside of Linux using the NIOS CPU core instantiated as part of the Altera HDMI example design. This code is used as-is for HDMI Rx, and modified to allow configuration of the HDMI Tx logic to support several different resolutions for the NDI Decode example design.
Custom IP
The NDI hardware implements a 4K block of register space which is manually added to the device-tree along with IRQ settings and other hardware details. The generic-uio driver is used to map the hardware register space and IRQs so they are available to user-space code.
Reserved Memory Regions
This design requires two large blocks of buffer memory, one for compressed video and raw audio data that must be visible to Linux, and one for raw video data which can be visible to Linux but may also be implemented as a separate memory bank for improved system performance.
NewTek_Reserved
This memory region is used to store compressed video data and raw audio data. This region is marked cacheable for performance. Cache coherency is maintained since the hardware accesses this memory region using a cache coherent bus interface.
NewTek_Video
This memory region is used to store raw (uncompressed) video data and thus requires high bandwidth. The application does not access this memory in normal operation so this region may be implemented as a dedicated FPGA-side memory bank to improve performance. This region will only appear in the device-tree if implemented using shared memory visible to Linux. If a live video input is not available and this memory region is visible to Linux, a video pattern can be written to allow testing of the compression core and software application.
Software
Standard Vendor IP Blocks
The standard Vendor IP blocks (for I2C and GPIO) are added to the device tree and stock Linux kernel drivers are used to communicate with this hardware.
Custom IP blocks
The device-tree entries for the custom FPGA IP use the generic-uio kernel driver. This driver makes the memory regions and IRQs specified by the device-tree entries accessible to user-space code. The libuio library is used to facilitate accessing the hardware.
The application also reads details regarding the reserved memory regions directly from the device-tree, as this functionality is not supported by libuio.
Initial Startup
<device-tree>
Contains memory addresses and IRQ details
src/cpp/ndi_common/hardware.*
Low-level access to register space and IRQs
src/cpp/ndi_common/device.*
Machine specific details
src/fpga/ip/common/Version.vhd
Version information compiled into the hardware
src/cpp/ndi_encode/ndi_encode.*
Top level application file (Encode)
src/cpp/ndi_decode/ndi_decode.*
Top level application file (Decode)
When the application is launched, quite a bit of setup is performed before continuing operation is passed to a number of created threads. Most of this setup happens when the hardware class is initialized. After the software processes any command-line options, an instance of the hardware class is created and initialized.
The hardware class uses libuio to access the memory regions and interrupts specified in the device-tree for the custom hardware IP. In addition, details for the reserved memory regions are read from the device-tree and the memory is mmap()'d to make it available for access. At this point, version information is read from the hardware and a device class specific to the hardware platform is created. This allows some behaviors to change between different physical boards (currently used to control any required ACP address translation). Finally, details regarding the UIO devices and reserved memory are printed to the console, and the reserved memory region for compressed video is written with the value 0xdeaddead, making it easier to identify uninitialized memory locations when debugging.
NDI Encode Operation
Video Input
src/cpp/ndi_encode/video_capture.*
video_capture:: class
src/cpp/ndi_encode/track.*
track:: class
src/fpga/ip/common/Vid_Track.vhd
video format tracking hardware
src/fpga/ip/common/Vid_In.vhd
Video input DMA logic
src/fpga/ip/common/Preview.vhd
Creates low-resolution preview stream
Filename varies
Low level video input logic
Video input starts with the low level video input logic. This logic is platform specific:
a10socdk
Rx
Altera HDMI RxTx example design
Agilex-7
Rx
Altera HDMI RxTx example design
ZCU104
Rx
Xilinx HDMI Rx example design
Zybo-Z7-20
Rx
Open-source HDMI Rx logic
The software for controlling the low-level video input logic is currently outside the scope of this example. The Zybo-Z7 HDMI input requires no control software, the Xilinx HDMI control software is running bare-metal on one of the Cortex-R5 cores, and the Altera HDMI control software runs on a dedicated NIOS core. See the "HDMI Input Logic" section under "FPGA Projects" for additional details.
There is, however, software support for low-level input monitoring code to send details extracted by the track:: class (locked/unlocked, resolution, etc) to the video_capture:: class. See video_capture::set_signal() and the do_commands() function in ndi_send.cpp for details.
The provided video input logic is intended as an easy to understand minimally functional example and is not recommended for use in production. In particular, the input logic does not gracefully handle corrupt video data, unexpected cable removal, and similar signal quality issues.
It is expected the user of this SDK will have their own custom video I/O logic specific to their target hardware. If this is not the case and a production capable video I/O solution is desired, there are numerous IP cores available from Xilinx, Altera, and 3rd parties capable of supporting all major video interfaces (SDI, HDMI, Display Port, CSI/DSI, etc.).
Once the low-level video signal is received by the FPGA hardware, the signal is synchronized to the main clock and converted to a standard interface (HDMI_T). This signal is sent to the format tracking logic (Vid_Track.vhd) and is filtered (Preview.vhd) to create a low resolution preview stream. The resulting full and preview resolution streams are each sent to a bus mastering DMA engine (Vid_In.vhd) which assembles pixels into words and writes the raw video data into memory.
The track:: class monitors the video format tracking hardware to detect the acquisition or loss of signal lock. When either event happens, appropriate details are communicated to the video_capture:: class to start or stop video streaming.
The video_capture:: class manages the recording of video data into the reserved memory buffer as well as hardware settings controlling things like the data format and preview decimation ratio. Since there are two input streams, the video_capture:: class operates on frame pairs. Each pair always contains a full resolution frame and may or may not contain a preview resolution frame. The maximum preview framerate is 30 fps, so if the full framerate is greater than 30 fps, some full frames will not have a corresponding preview frame. For simplicity, this is represented as a frame pair with an invalid preview pointer (NULL).
The video_capture::capture_frames thread loops through the following process to capture frame pairs:
Create a new frame pair
Allocate memory from the reserved buffer
Fill in frame details
Queue the frame (send details to the hardware)
Post the frame (tell hardware the new frame data is valid)
Wait for an interrupt indicating the frame has been captured
Send the frame to the compression logic
Once a frame pair is captured, it is placed on a work queue for the video compression engine. If this queue is too full, the oldest pending frame pair is dropped and a warning is printed.
NOTE: The video input logic assumes a "clean" video signal and is an example design intended to be simple and easy to understand. This logic does not attempt to deal with real world problems like random noise on an unterminated SDI input or when users plug/unplug the cable while the system is running. An actual production device should be more tolerant of signal errors.
Video Pattern
src/cpp/ndi_encode/video_pattern.*
video_pattern:: class
The video_pattern:: class is a sub-class of the video_capture:: class that does not rely on hardware to generate a video stream. Rather than capturing frames from hardware, this class generates a video pattern any time the signal format changes (it is possible to manually change the resolution at run-time from the console). The video pattern generated is twice as tall as the video format (see the m_scroll_dist member variable), and a preview resolution version of the pattern is also created (since both full and preview resolution streams are required). In normal operation, a frame-sized window of the over-sized video pattern is used to provide source video for the video_compress:: class, with the start point of the window moving down one line with each new frame, providing the illusion of moving video.
Video Compression
src/cpp/ndi_encode/video_compress.*
video_compress:: class
src/fpga/NDI_Enc/Encode_x4.vhd
4-core NDI Hardware Encoder
src/fpga/NDI_Enc/Encode_Xilinx.vhdp
Single NDI Hardware Encoder core (Xilinx)
src/fpga/NDI_Enc/Encode_Altera.vhd
Single NDI Hardware Encoder core (Altera)
One, two, or four copies of the single NDI Encoder core are instantiated in hardware (Encode_x4.vhd). The cores operate in parallel, with each core operating on one or more quarters of the video frame known as a "slice". This increases the effective throughput of the compression core and lowers system latency. The Encode_x4.vhd module also exposes several generics which can be used to configure various aspects of the encoder core.
The video_compress:: class monitors a work queue filled with raw video frames by the video_capture:: class (above). When a new frame pair is received the full resolution frame is sent to hardware for processing. Once the hardware generates an interrupt indicating processing is finished, the software performs minimal post-processing on the compression thread. Slice lengths are read from the hardware and written to memory by fixup_frame(), the resulting frame length is used to update the quality setting, then the frame is passed to the NDI send threads via add_frame_ndi().
The send threads (send_full, send_prvw) monitor independent work queues for the full and preview resolution video streams and pass new frames to the NDI stack. Sending a frame to the NDI stack is a synchronizing event which will block until the frame is fully processed, which is why there are independent threads for the two video streams. This gives each stream the maximum amount of time to send data to the listeners.
NDI Transmission
src/cpp/ndi_encode/video_compress.*
video_compress:: class
src/cpp/ndi_encode/network_send_video.cpp
network_send:: video support
The send_full() and send_prvw() threads in the video_compress:: class monitor independent work queues for the full and preview resolution video streams and pass new frames to network_send::add_frame(). This function is basically a shim which converts the video_compress::frame_t type into an NDIlib_video_frame_v2_t type and sends it to the NDI stack. Sending a frame to the NDI stack is a synchronizing event which will block until the frame is fully processed, which is why there are independent threads for the two video streams. This gives each stream the maximum amount of time to send data to the listeners.
Audio Input
src/cpp/ndi_encode/audio_capture.*
audio_capture:: class
src/fpga/ip/common/Aud_In.vhd
Audio input DMA logic
Audio input starts with the low level audio input logic. This logic supports standard I2S serial audio streams. Support for parallel audio data extracted from the HDMI streams will be supported soon. Once assembled into complete audio samples, the data is queued in a FIFO allowing burst writes to system memory.
The audio_capture:: class manages the recording of audio data into the reserved memory buffer as well as hardware settings controlling things like the data format (typically left-justified).
The audio_capture::capture_frames thread loops through a process very similar to the video_capture thread. One major difference is the audio logic currently supports "overlapped" commands, meaning the next command is written to hardware while the current command is still in progress:
Create a new frame
Allocate memory from the reserved buffer
Fill in frame details
Queue the frame (send details to the hardware)
Post the frame (tell hardware the new frame data is valid)
Loop until signaled to exit
Create a new frame
Allocate memory from the reserved buffer
Fill in frame details
Queue the frame (send details to the hardware)
Post the frame (tell hardware the new frame data is valid)
Wait for an interrupt indicating the frame has been captured
Send the frame to the compression logic
Audio Compression
src/cpp/ndi_encode/audio_compress.*
audio_compress:: class
This class simply passes captured audio frames from the audio_capture:: class to the NDI stack. It exists primarily to match the video path data flow, to allow any processing of the audio samples that might be required (eg: masking lower-order bits that might contain AES packet data), and because when this code was initially written the NDI API did not support 32-bit signed audio samples.
In future versions, expect the native NDI types to be used and the audio_compress class to be removed.
Audio Transmission
src/cpp/ndi_encode/network_send_audio.cpp
network_send:: audio support
This function is basically a shim which converts the audio_capture::frame_t type into an NDIlib_audio_frame_interleaved_32s_t type and sends it to the NDI stack.
Expect this code to be deprecated in future versions and the sending logic to be migrated to the audio_capture:: class.
Tally Operation
src/cpp/ndi_encode/tally.*
tally:: class
Tally outputs are implemented using the Linux kernel LED class. Tally LED entries are created in the device-tree, and the application uses the resulting sysfs entries to control the LED behavior.
The tally class initializes all LEDs to a known state on construction, then the tally process simply loops, checking for updates from the NDI stack. If new tally data is available, the tally LEDs are updated
NDI Decode Operation
Video Output
src/cpp/ndi_decode/video_playback.*
video_playback:: class
src/vpga/ip/common/Test_Pattern_Gen_YUV444
Video timing & pattern generator
src/fpga/ip/common/Vid_Out.vhd
Video output DMA logic
Filename varies
Low level video outputlogic
Video output timing is driven by the low level video logic. This logic is platform specific:
a10socdk
Tx
Altera HDMI RxTx example design modified for Tx only
ZCU104
Tx
Xilinx HDMI Tx example design
Zybo-Z7-20
Tx
Open-source DVI Tx logic
The software for controlling the low-level video input logic is currently outside the scope of this example. The Zybo-Z7 HDMI output requires no control software, the Xilinx HDMI control software is running bare-metal on one of the Cortex-R5 cores, and the Altera HDMI control software runs on a dedicated NIOS core (see the README file accompanying the hardware project files for details).
The video output logic generates a pixel stream with programmable timings based on the video clock which is programmable on most platforms to support multiple resolutions. The video pixel data can be a hardware generated test pattern or the video data can be filled in with data from memory for video playback. The vertical sync interrupt from Vid_Out.vhd is used to drive all output operations, and an NDI FrameSync instance is used to synchronize the received NDI video stream with the hardware output timing.
Video Decompression
src/cpp/ndi_decode/video_decode.*
video_decode:: class
src/fpga/NDI_Dec/Decode_x4.vhd
4-core NDI Hardware Encoder
src/fpga/NDI_Dec/Decode_Xilinx.vhdp
Single NDI Hardware Encoder core (Xilinx)
src/fpga/NDI_Dec/Decode_Altera.vhd
Single NDI Hardware Encoder core (Altera)
One, two, or four copies of the NDI Decoder core are instantiated in hardware (Decode_x4.vhd). The cores operate in parallel, with each core operating on one or more quarters of the video frame known as a "slice". This increases the effective throughput of the compression core and lowers system latency. The Decode_x4.vhd module also exposes several generics which can be used to configure various aspects of the decoder core.
The video_decode:: class monitors a work queue filled with NDI video frames by the video_playback:: class (above). When a new frame is received it is sent to hardware for processing. Once the hardware generates an interrupt indicating processing is finished, the software queues the decoded frame for display via video_playback::send_frame().
Video output process flow:
video_playback::pull_frames thread
Wait for vertical sync interrupt:
video_playback::pull_framesPull a frame from the NDI FrameSync Instance:
network_recv::get_video()Copy video frame to reserved memory region visible by hardware:
video_decode::add_frame()Push copied frame onto NDI Decode queue:
video_decode::add_frame()Free video frame acquired from the NDI FrameSync Instance
video_decode::decode_frames thread
Wait for new frame to decode
Setup hardware to decode NDI frame into raw video data:
video_decode::decode_frame()Start hardware decoding:
video_decode::post_frame()Wait for interrupt
Queue the raw frame for display:
video_playback::send_frame()The frame will display after the next vertical sync
Audio Output
src/cpp/ndi_decode/audio_playback.*
audio_playback:: class
src/fpga/ip/common/Aud_Out.vhd
Audio output DMA logic
Audio input starts with the low level audio input logic. This logic supports standard I2S serial audio streams. Support for parallel audio data extracted from the HDMI streams will be supported soon. Once assembled into complete audio samples, the data is queued in a FIFO allowing burst writes to system memory.
The audio_playback:: class manages the playback of audio data from the reserved memory buffer as well as hardware settings controlling things like the data format (typically left-justified).
The audio_playback::pull_frames thread loops through a process very similar to the video_playback thread. An interrupt is generated when a programmed audio DMA completes which drives the timing of the rest of the audio output operations.
Audio output process flow
audio_playback::pull_frames thread
Wait for audio interrupt:
audio_playback::pull_framesPull a frame from the NDI FrameSync Instance:
network_recv::get_audio()Convert audio frame to 32-bit data and copy to reserved memory region visible by hardware:
audio_playback::queue_frame()Free video frame acquired from the NDI FrameSync Instance
Enable playabck of audio data:
audio_playback::post_frame()
Supported Video Formats
The NDI Encode and Decode FPGA cores support a variety of standard and custom in-memory video formats for 4:2:2 and 4:2:0 YUV video. Planar alpha is also supported for 4:2:2 formats. For further details, see the generate_video() routine in src/cpp/ndi_encode/video_pattern.cpp.
Support for semi-planar formats and planar alpha can be
UYVY
4:2:2 Packed 8-bit
YUYV
4:2:2 Packed 8-bit, Y first
NV16
4:2:2 Semi-planar 8-bit - luma then packed chroma
UYVW
4:2:2 Packed 16-bit (custom format) - 16-bit version of UYVY
Y216
4:2:2 Packed 16-bit, Y first - 16-bit version of YUYV
P216
4:2:2 Semi-planar 16-bit - luma then packed chroma
420
4:2:0 Packed 8-bit (custom format) - Macroblock interleaved, 16 pixels of Y followed by 8 pixels of U (even lines) or V (odd lines)
NV12
4:2:0 Semi-planar 8-bit - Luma then packed chroma
420W
4:2:0 Packed 16-bit (custom format) - Macroblock interleaved, 16 pixels of Y followed by 8 pixels of U (even lines) or V (odd lines)
P016
4:2:0 Semi-planar 16-bit - Luma then packed chroma
Rebuilding from source
The FPGA reference design is broken into four major components, each with it's own subdirectory under the fpga_reference_design/ directory:
src/fpga/
FPGA Project Files
linux_kernel/
Linux kernel, U-Boot, and device-tree
src/cpp/
C++ Application Project Files
os_uSD/
Debian rootfs and uSD images
FPGA hardware design
The fpga_reference_design/src/fpga/ directory contains projects to build a bit-file for each supported platform. Refer to the "FPGA Projects" section for full details on building the hardware projects.
Boot loader, Linux kernel, and device-tree
The fpga_reference_design/linux_kernel/ directory contains projects to build a boot-loader, kernel, and device-tree for the supported platforms. The device-tree customizations required for the NDI FPGA hardware are also included in the example projects. Refer to the "Linux Kernel and Bootloader" section for full details.
Petalinux root file system
Application software
The fpga_reference_design/src/cpp/ directory contains example encode and decode software applications which use the FPGA based NDI logic to (de)compress live video data, and the Advanced NDI SDK to send and receive that data as an NDI stream. Refer to the "C++ Application Code" section for details.
Bootable uSD image
The fpga_reference_design/os_uSD/ directory contains scripts to create a Debian root filesystem as well as a bootable uSD card image. The resulting images contain all prerequisites needed to perform a self-hosted build of the example NDI application (ie: built on the ARM platform and not cross-compiled). Refer to the "uSD Image Builder" section for full details.
Last updated
Was this helpful?

