Welcome
TAPA is a task-parallel HLS framework that compiles C++ dataflow programs to Verilog RTL for Xilinx FPGAs, with software simulation requiring no FPGA hardware.
C++ source → tapa compile → RTL (.xo) → Vitis v++ → FPGA bitstream
Choose your path
- New to FPGA? → Your First Run
- Migrating from Vitis HLS? → Lab 3: Migrating from Vitis HLS
- Already know FPGA? → How-To Guides (start with software simulation)
Most common tasks
Next step: Installation
Installation
Purpose: Install TAPA on your development machine.
When to use this: Setting up TAPA for the first time.
What you need
| Dependency | Version | Notes |
|---|---|---|
GNU C++ Compiler (g++) | 7.5.0 or newer | Required for software simulation and deployment |
| Xilinx Vitis | 2022.1 or newer | Not needed for software simulation — only required for RTL synthesis and deployment |
TAPA has been tested on the following operating systems:
| OS | Minimum version | Notes |
|---|---|---|
| Ubuntu | 18.04 | |
| Debian | 10 | |
| Red Hat Enterprise Linux | 9 | Derivatives (AlmaLinux 9+, Rocky Linux 9+) also supported |
| Amazon Linux | 2023 | |
| Fedora | 34 | Fedora 39+ may have minor issues due to C library changes and Vitis HLS incompatibility |
Install from release
curl -fsSL https://raw.githubusercontent.com/tuna/tapa/main/install.sh | sh -s -- -q
This installs the current stable release (0.1.20260319). With root
privileges, TAPA installs to /opt/tapa with symlinks in /usr/local/bin.
Otherwise it installs to ~/.tapa and adds itself to your PATH via your
shell profile.
TAPA's internal toolchain is being incrementally refactored to Rust for improved
performance and reliability. During this transition, we recommend staying on the
stable release (0.1.20260319) for production workloads. To try the latest
(potentially unstable) release instead, pass --beta:
curl -fsSL https://raw.githubusercontent.com/tuna/tapa/main/install.sh | sh -s -- -q --beta
To install a specific version:
curl -fsSL https://raw.githubusercontent.com/tuna/tapa/main/install.sh \
| TAPA_VERSION=0.1.20260319 sh -s -- -q
Releases are available at github.com/tuna/tapa/releases.
Install g++
Install g++ using the package manager for your OS.
Ubuntu / Debian
For Ubuntu 18.04 and newer, or Debian 10 and newer:
sudo apt-get install g++
RHEL / Amazon Linux
For Red Hat Enterprise Linux 9 and newer, derivatives like AlmaLinux 9 and newer and Rocky Linux 9 and newer, or Amazon Linux 2023:
sudo yum install gcc-c++ libxcrypt-compat
Fedora
For Fedora 34 and newer. Fedora 39 and newer may have minor issues due to system C library changes and Vitis HLS tool incompatibility.
sudo yum install gcc-c++ libxcrypt-compat
Verify installation
tapa --version
Building from source
For source builds (full toolchain requirements and build commands), see Building from Source.
If installation fails, see Common Errors for known issues.
Next step: Your First Run
Your First Run
Run your first TAPA software simulation without FPGA hardware.
When to use this
Use this guide when you are learning TAPA for the first time, or when you want to quickly verify a design's correctness without synthesizing RTL or running on physical hardware.
What you need
- TAPA installed — see Installation
g++7.5.0 or newer (check withg++ --version)- The vadd example files:
vadd.cppandvadd-host.cpp
Commands
Compile the kernel and host code together using the tapa g++ wrapper, then
run the resulting binary with no arguments to trigger software simulation:
tapa g++ -- vadd.cpp vadd-host.cpp -o vadd
./vadd
tapa g++ is a wrapper around the GNU C++ compiler that automatically
includes the necessary TAPA headers and libraries. It prints the underlying
g++ command it invokes for reference.
Both the kernel file (vadd.cpp) and the host file (vadd-host.cpp) must
be passed in the same command. The kernel file is used for software
simulation.
Expected output
I20000101 00:00:00.000000 0000000 task.h:66] running software simulation with TAPA library
kernel time: 1.19429 s
PASS!
What this proves
The PASS! line confirms the vector addition produced correct results. The
first line shows that TAPA executed the kernel on the CPU using its
coroutine-based software simulator — no FPGA or Xilinx tools were involved.
If something goes wrong
If the build fails, the binary hangs, or the output shows FAIL!, see
Your First Debug Cycle.
Next step
Your First Debug Cycle
Diagnose and fix failures in TAPA software simulation.
Prerequisites
- TAPA installed — see Installation
- A simulation binary built with
tapa g++— see Your First Run
Symptom
The simulation hangs without producing output, crashes with an error, or
prints FAIL! instead of PASS!.
How to confirm: run single-threaded
By default TAPA runs each task in its own coroutine using a thread pool sized to the number of physical CPU cores. Reducing concurrency to one thread improves reproducibility and simplifies debugging:
TAPA_CONCURRENCY=1 ./vadd
If the hang disappears or a crash becomes reproducible, the problem is likely a race condition or a deadlock that only manifests under concurrent execution.
Fix patterns
Attach GDB
Software simulation runs as a normal CPU process, so a debugger works without any special setup:
gdb ./vadd
Set a breakpoint on any TAPA task function by name and run:
(gdb) b VecAdd
(gdb) run
You can set breakpoints on any leaf task (Add, Mmap2Stream, Stream2Mmap,
etc.) and step through the code exactly as you would for a regular C++
program.
Dump stream contents
Set TAPA_STREAM_LOG_DIR to a directory path before running. TAPA will write
one log file per named stream under that directory, recording every value
written to the stream:
TAPA_STREAM_LOG_DIR=/tmp/logs ./vadd
Log format:
- Primitive types (
int,float, …) are written as decimal text, one value per line. - Structs without
operator<<are written as little-endian hex. - Structs with
operator<<are written using your operator.
After the run, inspect the files under /tmp/logs/ to trace data as it
flows through each stream and locate where incorrect values first appear.
Common mistakes to check
| Symptom | Likely cause | Fix |
|---|---|---|
| Hangs forever | Deadlock or backpressure — a stream is full or empty and no task can make progress | Deadlocks & Hangs |
Wrong output (FAIL!) | Logic error in a leaf task | Attach GDB or dump stream contents (above) |
| Build fails with template errors | Pass-by-value/reference mismatch on streams or mmaps | Common Errors |
Always pass your design through software simulation before attempting RTL synthesis or hardware simulation. Software simulation compiles in seconds, and standard tools like GDB and AddressSanitizer work without modification.
To catch memory errors, compile with sanitizers:
tapa g++ -- vadd.cpp vadd-host.cpp -fsanitize=address -g -o vadd
Next step
Full FPGA Compilation
Compile a TAPA design to an FPGA bitstream and run it on hardware.
When to use this
Use this guide after software simulation passes (see Your First Run) and you are ready to target real hardware or run a more accurate RTL-level simulation.
What you need
- TAPA installed — see Installation
- Xilinx Vitis 2022.1 or newer
- A compatible Alveo platform (the examples below use the U250)
- The vadd source files:
vadd.cppandvadd-host.cpp
Stage 1 — Synthesize to RTL
Run tapa compile to translate the C++ kernel into an RTL object (.xo):
tapa \
compile \
--top VecAdd \
--part-num xcu250-figd2104-2L-e \
--clock-period 3.33 \
-f vadd.cpp \
-o vecadd.xo
| Flag | Meaning |
|---|---|
--top | Name of the top-level TAPA task |
--part-num | Target FPGA part number |
--clock-period | Target clock period in nanoseconds |
-f | Kernel source file |
-o | Output XO file |
You can replace --part-num and --clock-period with --platform to
target a Vitis platform directly, for example:
--platform xilinx_u250_gen3x16_xdma_4_1_202210_1
HLS reports are written to work.out/report/ after synthesis completes.
Artifact produced: vecadd.xo
Stage 2 — Fast hardware simulation
Before waiting hours for a full bitstream, validate the RTL with TAPA's
fast cosimulation. Pass the .xo file as the --bitstream argument:
./vadd --bitstream=vecadd.xo 1000
Fast cosim uses simplified models for external components (DRAM, AXI
interconnect) so setup takes only a few seconds instead of the ten-plus
minutes that Vitis cosimulation requires. A successful run prints PASS!.
The default simulator backend is xsim, which requires Vivado on Linux. To use
Verilator instead (cross-platform, no Vivado required), pass -cosim_simulator verilator
to the host executable: ./vadd --bitstream=vadd.xo -cosim_simulator verilator.
Stage 3 — Link to xclbin
Use Vitis v++ to link the .xo into a hardware bitstream. This step does
not involve TAPA and typically takes several hours:
v++ -o vadd.xilinx_u250_gen3x16_xdma_4_1_202210_1.hw.xclbin \
--link \
--target hw \
--kernel VecAdd \
--platform xilinx_u250_gen3x16_xdma_4_1_202210_1 \
vecadd.xo
Artifact produced: vadd.xilinx_u250_gen3x16_xdma_4_1_202210_1.hw.xclbin
Hardware binary generation typically takes several hours. Plan accordingly, and ensure your machine will remain available for the full duration.
Stage 4 — On-board execution
With an Alveo card installed and XRT configured, run the host binary and point it at the generated xclbin:
./vadd --bitstream=vadd.xilinx_u250_gen3x16_xdma_4_1_202210_1.hw.xclbin
A successful on-board run prints PASS!, confirming the accelerator
produced correct results on real hardware.
Next step
The Programming Model
Purpose: Understand the TAPA task-parallel programming model.
Prerequisites: Installation
TAPA bridges familiar sequential C++ to FPGA hardware parallelism. Rather than requiring users to write RTL directly, it lets them express computation as a graph of concurrently-running tasks communicating through typed streams and shared memory interfaces.
Why this exists
Writing FPGA accelerators traditionally requires either low-level RTL or fragile HLS pragmas that break when code is refactored. TAPA solves this by letting you describe the parallel structure of your design as a graph of C++ functions. The compiler turns that graph into RTL automatically, while the same code runs natively on a CPU for simulation. You get the productivity of C++ without giving up the ability to express fine-grained, concurrent hardware pipelines.
Mental model
A TAPA design is a directed graph of tasks connected by streams and memory interfaces. Scalars are passed as function arguments.
Host
│ tapa::invoke(TopTask, bitstream, mmap_args...)
▼
Top-level task ← no computation; spawns all leaf tasks
├── spawns ──> Leaf task A (writes to stream S)
│ stream S
├── spawns ──> Leaf task B (reads stream S, writes to stream T)
│ stream T
└── spawns ──> Leaf task C (reads stream T, writes to mmap)
mmap ──> DRAM
- The host calls
tapa::invoke, passing the kernel function, a bitstream path (empty for software simulation), and the kernel arguments. - The top-level task is the entry point synthesized by
tapa compile. It declares streams as local objects, then spawns all leaf tasks and passes streams to them by reference. It contains no computation of its own. - Leaf tasks perform the actual computation. One leaf writes to a stream; another reads from it. Streams flow between leaf tasks — the top-level task is never the producer or consumer of stream data.
All child tasks spawned by tapa::task().invoke(...) run concurrently. The
top-level task returns only after every child task has finished.
Minimal correct example
Kernel file (vadd.cpp)
The top-level task VecAdd declares three streams, then launches four leaf
tasks that run in parallel:
void VecAdd(tapa::mmap<const float> a, tapa::mmap<const float> b,
tapa::mmap<float> c, uint64_t n) {
tapa::stream<float> a_q("a");
tapa::stream<float> b_q("b");
tapa::stream<float> c_q("c");
tapa::task()
.invoke(Mmap2Stream, a, n, a_q)
.invoke(Mmap2Stream, b, n, b_q)
.invoke(Add, a_q, b_q, c_q, n)
.invoke(Stream2Mmap, c_q, c, n);
}
Host file (vadd-host.cpp)
The host calls tapa::invoke with the kernel function, the bitstream path, and
the kernel arguments. When the bitstream path is empty (the default), TAPA runs
software simulation:
#include <gflags/gflags.h>
#include <tapa.h>
DEFINE_string(bitstream, "", "Path to XO or xclbin file. Empty = software simulation.");
int main(int argc, char* argv[]) {
gflags::ParseCommandLineFlags(&argc, &argv, true);
std::vector<float, tapa::aligned_allocator<float>> a(n), b(n), c(n);
// ... fill a and b ...
int64_t kernel_time_ns = tapa::invoke(
VecAdd, FLAGS_bitstream,
tapa::read_only_mmap<const float>(a),
tapa::read_only_mmap<const float>(b),
tapa::write_only_mmap<float>(c),
n);
}
The --bitstream flag is what controls which backend runs:
- Omitted or empty → software simulation
.xo→ fast cosimulation.hw.xclbin→ on-board execution
Rules
- Host code and kernel code must live in separate files. The kernel file is compiled to RTL; the host file is compiled to a CPU executable.
- The kernel file must contain exactly one top-level task — the function
passed as
--toptotapa compile. - The top-level task is called via
tapa::invokefrom the host; never called directly. - An upper-level task body must contain only stream declarations,
tapa::task().invoke(...)chains, and scalar/mmap argument forwarding — no computation. - Streams are passed by reference (
tapa::istream<T>&,tapa::ostream<T>&). Passing streams by value is a compile error. - mmap arguments are passed by value (
tapa::mmap<T>), not by reference. - Scalar arguments (plain C++ types such as
int,float,uint64_t) are passed by value and are read-only to the kernel. The kernel cannot communicate a result back to the host through a scalar parameter; use an mmap or stream instead. - Software simulation runs automatically when
tapa::invokereceives an empty bitstream path.
Common mistakes
Wrong: calling the top-level task directly from host code
// WRONG — bypasses the TAPA runtime entirely; streams are not initialized,
// hardware execution cannot be dispatched.
VecAdd(tapa::mmap<const float>(a.data()), /* ... */);
Right: always use tapa::invoke
// RIGHT — works for software simulation, cosim, and on-board execution.
tapa::invoke(VecAdd, FLAGS_bitstream,
tapa::read_only_mmap<const float>(a),
tapa::read_only_mmap<const float>(b),
tapa::write_only_mmap<float>(c),
n);
tapa::invoke examines the bitstream path at runtime and dispatches to the
correct backend: software simulation (empty path), RTL co-simulation (.xo),
emulation (.hw_emu.xclbin), or on-board execution (.hw.xclbin).
See also
Next step: The Compile Pipeline
The Compile Pipeline
Purpose: Understand the three-stage TAPA compile pipeline.
Prerequisites: The Programming Model
Each tapa subcommand maps to one pipeline stage. Knowing the stages helps
diagnose failures, parallelize synthesis, and use remote execution correctly.
Why this exists
Compiling a TAPA design involves three distinct concerns: parsing C++ and
extracting the task graph, synthesizing each task to RTL with Vitis HLS, and
packaging the RTL into an .xo file for Vitis. Separating these stages lets
you re-run only the parts that changed, run synthesis on a remote machine with
Xilinx tools, and parallelize synthesis across tasks.
Mental model
C++ source
│
▼ tapa analyze (always local)
task graph JSON
│
▼ tapa synth (can run remotely, parallelizable with -j)
per-task RTL (Verilog)
│
▼ tapa pack (can run remotely)
.xo file
╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌ (TAPA boundary) ╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌
│
▼ v++ --link (Vitis, not TAPA)
.xclbin
tapa analyze — Runs tapa-cpp and tapacc locally. Reads your C++
source, resolves task boundaries, and writes a task graph JSON to the work
directory. No vendor tools are required for this step.
tapa synth — Invokes Vitis HLS for each task to produce per-task
Verilog RTL. This is the most time-consuming step. With -j N, up to N tasks
are synthesized in parallel. With --remote-host, synthesis runs on a remote
Linux machine that has Vitis HLS installed.
tapa pack — Combines the per-task RTL into a single Xilinx IP package
(.xo file) suitable for v++ --link.
Shortcut: tapa compile runs all three stages in the correct order in a
single command.
Minimal correct example
All-in-one (most common)
tapa compile \
--top VecAdd \
--part-num xcu250-figd2104-2L-e \
--clock-period 3.33 \
-f vadd.cpp \
-o vecadd.xo
Use --platform instead of --part-num when targeting a full Vitis platform:
tapa compile \
--top VecAdd \
--platform xilinx_u250_gen3x16_xdma_4_1_202210_1 \
--clock-period 3.33 \
-f vadd.cpp \
-o vecadd.xo
Running stages separately
--work-dir is a top-level tapa flag that applies to all subcommands. It
must be the same across all three stages when running them separately (default:
work.out).
Run tapa analyze first to extract the task graph (no vendor tools needed):
tapa --work-dir work.out analyze \
--top VecAdd \
-f vadd.cpp
Then run tapa synth to synthesize each task to RTL, optionally in parallel
and/or on a remote host:
tapa --work-dir work.out synth \
--part-num xcu250-figd2104-2L-e \
--clock-period 3.33 \
-j 4
Finally, run tapa pack to produce the .xo file:
tapa --work-dir work.out pack \
-o vecadd.xo
Rules
tapa analyzealways runs locally, even when--remote-hostis set.tapa synthandtapa packrun on the remote host when--remote-hostis provided.tapa compileis the shortcut for all three stages and handles stage ordering automatically.- The
-j/--jobsflag ontapa synthcontrols how many Vitis HLS processes run in parallel. Keep it at or below the available core count on the synthesis machine. --work-diris a top-level flag:tapa --work-dir DIR <subcommand>.
Common mistakes
Wrong: running tapa synth before tapa analyze
# WRONG — the task graph JSON does not exist yet; tapa synth will fail
# with a missing file error.
tapa --work-dir work.out synth --part-num xcu250-figd2104-2L-e --clock-period 3.33
Right: always run tapa analyze first, or use tapa compile
# RIGHT — explicit ordering
tapa --work-dir work.out analyze --top VecAdd -f vadd.cpp
tapa --work-dir work.out synth --part-num xcu250-figd2104-2L-e --clock-period 3.33
tapa --work-dir work.out pack -o vecadd.xo
# RIGHT — shortcut that handles ordering automatically
tapa compile --top VecAdd --part-num xcu250-figd2104-2L-e \
--clock-period 3.33 -f vadd.cpp -o vecadd.xo
Note about v++ link
The v++ --link step that produces .xclbin is performed by Xilinx Vitis,
not TAPA. TAPA's output is the .xo file. See
Build & Run on Board for the full linking
workflow.
See also
Next step: Tasks
Tasks
Purpose: Understand TAPA's three task types and their constraints.
Prerequisites: The Programming Model
Why this exists
TAPA organizes an FPGA accelerator as a hierarchy of C++ functions called tasks. This hierarchy lets the compiler assign each leaf task to an independent HLS module synthesized in parallel, while upper-level tasks provide the wiring between those modules. The result is a design whose parallel structure is explicit in the source code rather than inferred from pragmas.
Mental model
A TAPA design forms a tree of tasks:
Top-level task (entry point, kernel boundary)
├── Upper-level task (orchestration only)
│ ├── Leaf task A (computation)
│ └── Leaf task B (computation)
└── Leaf task C (computation)
Each level has a distinct role:
- Leaf task — performs computation: loops, arithmetic, stream reads/writes. May call ordinary C++ functions. Must NOT invoke other TAPA tasks.
- Upper-level task — orchestrates execution. Its body may only instantiate
streams and invoke child tasks with
tapa::task().invoke(...). It contains no computation of its own. - Top-level task — the kernel entry point invoked from the host via
tapa::invoke. For thexilinx-vitistarget (the default), the top-level task must itself be an upper-level task.
Minimal correct example
The VecAdd function from the vector-add example is a top-level upper-level
task. It instantiates three streams, then invokes four leaf tasks:
void VecAdd(tapa::mmap<const float> a, tapa::mmap<const float> b,
tapa::mmap<float> c, uint64_t n) {
tapa::stream<float> a_q("a");
tapa::stream<float> b_q("b");
tapa::stream<float> c_q("c");
tapa::task()
.invoke(Mmap2Stream, a, n, a_q)
.invoke(Mmap2Stream, b, n, b_q)
.invoke(Add, a_q, b_q, c_q, n)
.invoke(Stream2Mmap, c_q, c, n);
}
Mmap2Stream, Add, and Stream2Mmap are leaf tasks that each perform a
specific computation. VecAdd contains no computation — only stream
declarations and .invoke(...) calls.
Detached tasks
By default a parent task waits for all child tasks to finish before it terminates. A detached task is instead left running; the parent does not wait for it. This is useful for purely data-driven tasks that have no natural termination point (e.g., a constant data source or an infinite switch network).
tapa::task().invoke<tapa::detach>(LeafTask, arg1, arg2);
Detached tasks are similar to std::thread::detach in the C++ STL. Because
their state does not need to be tracked, they avoid fan-out termination signals
and reduce area.
By default, TAPA tasks are joined: the parent waits for each child to complete.
Use tapa::detach only when the child task genuinely does not need to
terminate on program completion.
Rules
- Leaf tasks receive streams by reference (
istream<T>&,ostream<T>&) and mmap interfaces by value (mmap<T>). - An upper-level task body must contain only stream instantiations and
.invoke(...)calls — no loops, arithmetic, or other computation. async_mmapchannel operations (read_addr,read_data, etc.) are leaf-task-only.- For the
xilinx-vitistarget (the default), the top-level task must be an upper-level task — it cannot be a leaf task. - Leaf templated tasks (template functions that compute directly) are supported. Non-leaf templated tasks that invoke other tasks are not yet supported.
Common mistakes
Wrong — computation inside an upper-level task body:
// Wrong: for loop makes this a leaf task, not an upper-level task
void BadUpper(tapa::mmap<float> mem, uint64_t n) {
tapa::stream<float> q("q");
for (uint64_t i = 0; i < n; ++i) { // <-- computation here
q.write(mem[i]);
}
tapa::task().invoke(Consumer, q, n);
}
Right — move computation into a dedicated leaf task:
void Loader(tapa::mmap<float> mem, uint64_t n, tapa::ostream<float>& q) {
for (uint64_t i = 0; i < n; ++i) {
q.write(mem[i]);
}
}
void GoodUpper(tapa::mmap<float> mem, uint64_t n) {
tapa::stream<float> q("q");
tapa::task()
.invoke(Loader, mem, n, q)
.invoke(Consumer, q, n);
}
See also
Next step: Streams
Streams
Purpose: Communicate between TAPA tasks using typed FIFO streams.
Prerequisites: Tasks
Why this exists
Streams are the primary inter-task communication mechanism in TAPA. They are typed, directional FIFOs that appear explicitly in task signatures, making data flow visible in the source code. Unlike shared memory, streams enforce a single-writer/single-reader discipline and make producer–consumer relationships unambiguous to both the programmer and the compiler.
Mental model
A stream instance lives in an upper-level task. Leaf tasks receive directional references to it:
// Upper-level task instantiates the stream and wires it to two leaf tasks
void Upper(/* ... */) {
tapa::stream<float, 16> data_q("data_q"); // depth = 16 elements
tapa::task()
.invoke(Producer, data_q) // Producer writes to data_q
.invoke(Consumer, data_q); // Consumer reads from data_q
}
// Leaf task signatures use directional references
void Producer(tapa::ostream<float>& out) { /* ... */ }
void Consumer(tapa::istream<float>& in) { /* ... */ }
The stream<T, Depth> template parameter controls the hardware FIFO depth
(default: 2). A larger depth reduces the chance of stalls at the cost of FPGA
BRAM resources.
Blocking read and write
void Task(tapa::istream<int>& in, tapa::ostream<int>& out) {
int data = in.read(); // blocks until data is available
out.write(data); // blocks until space is available
}
The << and >> operator aliases are equivalent:
out << data; // same as out.write(data)
in >> data; // same as data = in.read()
Non-blocking read and write
To read from multiple streams or achieve an initiation interval of one, use the
non-blocking variants that return a bool indicating success:
void Task(tapa::istream<int>& in, tapa::ostream<int>& out) {
int data;
bool ok = in.try_read(data); // returns false if stream is empty
if (ok) {
out.try_write(data); // returns false if stream is full
}
}
Readiness checks
Check stream state before committing to a read or write:
if (!in.empty()) { /* safe to read */ }
if (!out.full()) { /* safe to write */ }
For non-destructive inspection, peek returns the front element and a validity
flag without consuming it:
bool valid;
auto val = in.peek(valid); // does not remove the token
if (valid && /* routing decision */) {
in.read(nullptr); // consume now
}
End-of-Transaction (EoT)
A producer signals the end of a data stream by calling close(). The consumer
detects it with try_eot():
// Producer
void Mmap2Stream(tapa::mmap<const float> mem, uint64_t n,
tapa::ostream<float>& stream) {
for (uint64_t i = 0; i < n; ++i) {
stream.write(mem[i]);
}
stream.close(); // send EoT token
}
// Consumer
void Stream2Mmap(tapa::istream<float>& stream, tapa::mmap<float> mem) {
for (uint64_t i = 0;;) {
bool eot;
if (stream.try_eot(eot)) {
if (eot) break;
mem[i++] = stream.read(nullptr);
}
}
}
EoT loop helper macros
TAPA provides macros that encapsulate the non-blocking EoT check pattern:
TAPA_WHILE_NOT_EOT(stream)— loops untilstreamdelivers an EoT token; body executes only when a valid non-EoT token is available.TAPA_WHILE_NEITHER_EOT(s1, s2)— loops until either stream delivers EoT; body executes only when both have a valid token.TAPA_WHILE_NONE_EOT(s1, s2, s3)— three-stream variant.
void Consumer(tapa::istream<int>& in, tapa::ostream<int>& out) {
TAPA_WHILE_NOT_EOT(in) {
out.write(in.read(nullptr));
}
out.close();
}
A downstream task can reopen a closed stream with stream.open() to reuse it
across multiple transactions.
Stream arrays
For parameterized designs, TAPA provides arrays of streams:
tapa::streams<T, N>— array of N streams (instantiation in upper-level task)tapa::istreams<T, N>&/tapa::ostreams<T, N>&— directional array references in leaf task signatures
When invoking N parallel instances of a leaf task, use invoke<tag, N>(...)
and TAPA distributes the array elements automatically:
void InnerStage(int b, tapa::istreams<pkt_t, kN / 2>& in_q0,
tapa::istreams<pkt_t, kN / 2>& in_q1,
tapa::ostreams<pkt_t, kN>& out_q) {
tapa::task().invoke<tapa::detach, kN / 2>(Switch2x2, b, in_q0, in_q1, out_q);
}
Rules
- Always pass streams by reference:
istream<T>&,ostream<T>&. Never by value — the stream object is not copyable. - Each stream instance must have exactly one reader and exactly one writer.
- TAPA software simulation respects stream depth: a full stream blocks the writer, matching hardware behavior.
- Stream depth is a hardware FIFO size. The FPGA resource used depends on depth:
- Depth < 128: synthesised from SRL shift-registers (no BRAM cost).
- Depth ≥ 128: mapped to BRAM.
- Depth ≥ 4096 and element width ≥ 36 bits: mapped to URAM.
- Default depth is 2, which costs only SRL resources.
Common mistakes
Wrong — stream passed by value (drops the reference, triggers a copy):
void Leaf(tapa::istream<float> in) { /* ... */ } // missing &
Right — stream passed by reference:
void Leaf(tapa::istream<float>& in) { /* ... */ }
See also
Next step: Memory Access: mmap
Memory Access: mmap
Purpose: Access FPGA-adjacent DRAM from TAPA leaf tasks using mmap.
Prerequisites: Tasks
Why this exists
FPGA designs need to read from and write to off-chip DRAM. tapa::mmap<T>
provides an array-like interface that TAPA compiles to AXI4 memory-mapped
transactions. It is simpler to use than async_mmap and is the right choice
when latency hiding is not required or when access patterns are straightforward.
Mental model
A leaf task receives mmap<T> by value and accesses it like a C array:
void Mmap2Stream(tapa::mmap<const float> mem, uint64_t n,
tapa::ostream<float>& stream) {
for (uint64_t i = 0; i < n; ++i) {
stream << mem[i]; // array subscript operator
}
}
The upper-level task passes the mmap argument through to the leaf:
void VecAdd(tapa::mmap<const float> a, tapa::mmap<const float> b,
tapa::mmap<float> c, uint64_t n) {
tapa::stream<float> a_q("a");
tapa::stream<float> b_q("b");
tapa::stream<float> c_q("c");
tapa::task()
.invoke(Mmap2Stream, a, n, a_q)
.invoke(Mmap2Stream, b, n, b_q)
.invoke(Add, a_q, b_q, c_q, n)
.invoke(Stream2Mmap, c_q, c, n);
}
Minimal correct example
Mmap2Stream from the vector-add example reads from a read-only mmap and
writes the values into a stream:
void Mmap2Stream(tapa::mmap<const float> mmap, uint64_t n,
tapa::ostream<float>& stream) {
for (uint64_t i = 0; i < n; ++i) {
stream << mmap[i];
}
}
Note that mmap is passed by value (no &).
Host-side wrappers
On the host, the direction of host-to-kernel data transfer is declared in the
tapa::invoke call using wrapper types:
tapa::read_only_mmap<T>(vec)— host sends data to the kernel; kernel readstapa::write_only_mmap<T>(vec)— kernel writes; host receives data backtapa::read_write_mmap<T>(vec)— bidirectional transfer
read_only_mmap and write_only_mmap describe the host-to-kernel transfer
direction, not the kernel's internal access pattern. The kernel task always
receives a plain mmap<T> parameter regardless of which wrapper was used.
From the vector-add host code:
tapa::invoke(
VecAdd, FLAGS_bitstream,
tapa::read_only_mmap<const float>(a),
tapa::read_only_mmap<const float>(b),
tapa::write_only_mmap<float>(c),
n);
Aligned allocator
If the host std::vector is not page-aligned, the TAPA runtime must make an
extra copy when transferring data to the FPGA. Use tapa::aligned_allocator<T>
to avoid this:
std::vector<float, tapa::aligned_allocator<float>> a(n);
std::vector<float, tapa::aligned_allocator<float>> b(n);
std::vector<float, tapa::aligned_allocator<float>> c(n);
This eliminates the extra copy and suppresses XRT alignment warnings.
Shared mmap
The same mmap argument can be passed to multiple child tasks. TAPA inserts
an AXI interconnect so both tasks share the same AXI port:
void Load(tapa::mmap<float> srcs, uint64_t n,
tapa::ostream<float>& a, tapa::ostream<float>& b) {
tapa::task()
.invoke(Mmap2Stream, srcs, 0, n, a)
.invoke(Mmap2Stream, srcs, 1, n, b);
}
When a mmap is shared across tasks, the programmer is responsible for memory consistency. Concurrent accesses to the same addresses will produce undefined results.
mmap arrays
For parameterized designs with multiple independent memory ports:
tapa::mmaps<T, N>— array of N mmap interfaces (kernel side)tapa::read_only_mmaps<T, N>/tapa::write_only_mmaps<T, N>/tapa::read_write_mmaps<T, N>— directional wrappers fortapa::invokeon the host side
// Host side
tapa::invoke(VecAdd, FLAGS_bitstream,
tapa::read_only_mmaps<float, M>(a),
tapa::read_only_mmaps<float, M>(b),
tapa::write_only_mmaps<float, M>(c), n);
// Kernel side
void VecAdd(tapa::mmaps<float, M> a, tapa::mmaps<float, M> b,
tapa::mmaps<float, M> c, uint64_t n) { /* ... */ }
Rules
- Kernel task signatures:
mmap<T>must be passed by value (no&). This is the opposite of streams. mmapcan only be used as a function parameter, not as a local variable.read_only_mmap/write_only_mmapdescribe host-to-kernel transfer direction only; they do not constrain kernel access patterns.
Common mistakes
Wrong — mmap passed by reference:
void Kernel(tapa::mmap<float>& mem) { /* ... */ } // & is wrong
Right — mmap passed by value:
void Kernel(tapa::mmap<float> mem) { /* ... */ }
See also
Next step: Memory Access: async_mmap
Memory Access: async_mmap
Purpose: Use async_mmap to overlap DRAM access latency with computation.
Prerequisites: Memory Access: mmap
Why this exists
mmap does not provide explicit control over outstanding DRAM transactions.
The HLS tool may issue burst transactions for sequential access, but for
random-access patterns or designs that need fine-grained control over
outstanding requests, the lack of explicit flow control limits throughput.
Off-chip DRAM latency is typically 100–200 ns, and without the ability to
overlap request issuance with data receipt, achievable bandwidth stays far
below the channel peak.
async_mmap exposes the five AXI channels as individual streams, letting you
issue multiple outstanding requests and overlap address issuance with data
receipt. The result is much higher DRAM throughput — especially for random
access — and significantly lower area overhead compared to the Vitis HLS
m_axi interface.
Mental model: five AXI channels
async_mmap<T> is a struct whose fields are streams corresponding to the five
AXI channels:
template <typename T>
struct async_mmap {
using addr_t = int64_t;
using resp_t = uint8_t;
tapa::ostream<addr_t> read_addr; // issue read addresses
tapa::istream<T> read_data; // receive read data
tapa::ostream<addr_t> write_addr; // issue write addresses
tapa::ostream<T> write_data; // send write data
tapa::istream<resp_t> write_resp; // receive write acknowledgments
};
The key insight is that read_addr and read_data are decoupled: you can
issue many addresses into read_addr before any data arrives on read_data,
hiding latency by keeping multiple requests in flight simultaneously.
Minimal correct example
The pattern for overlapping read requests and responses in a single pipelined loop:
void ReadKernel(tapa::async_mmap<float>& mem, float* result,
uint64_t n) {
for (int i_req = 0, i_resp = 0; i_resp < n;) {
#pragma HLS pipeline II=1
// Issue a read address if the channel has space
if (i_req < n && mem.read_addr.try_write(i_req)) {
++i_req;
}
// Consume a read response if data is available
if (!mem.read_data.empty()) {
result[i_resp] = mem.read_data.read(nullptr);
++i_resp;
}
}
}
Two loop counters (i_req, i_resp) track outstanding requests. Because both
checks are non-blocking, the loop can issue a new address and receive a
response in the same clock cycle.
Runtime burst detection
TAPA coalesces sequential addresses into AXI bursts automatically at runtime. You only need to issue individual element-by-element addresses; TAPA's generated hardware merges adjacent requests into larger burst transactions dynamically. This provides burst efficiency for sequential patterns without requiring static analysis or explicit burst programming in your kernel code.
Area comparison
async_mmap uses significantly fewer FPGA resources than the Vitis HLS
m_axi interface, which is important for HBM devices that expose many memory
channels:
| Memory Interface | Clock (MHz) | LUT | FF | BRAM | URAM | DSP |
|---|---|---|---|---|---|---|
#pragma HLS interface m_axi | 300 | 1189 | 3740 | 15 | 0 | 0 |
async_mmap | 300 | 1466 | 162 | 0 | 0 | 0 |
async_mmap uses no BRAM and drastically fewer flip-flops, at the cost of
slightly more LUTs for the burst-detection logic.
Rules
async_mmap<T>must be passed by reference (async_mmap<T>&). Passing by value is deprecated.- Channel operations (
try_read/try_writeon the five streams) are leaf-task only. An upper-level task may accept and forward anasync_mmap<T>¶meter to a child leaf task without operating on it. - An
mmap<T>argument can be passed to anasync_mmap<T>¶meter — mmap is automatically promoted. - Only non-blocking operations (
try_read,try_write) should be used onasync_mmapchannels inside pipelined loops.
Never use blocking read/write on async_mmap channels inside a pipelined
loop. Blocking operations prevent other channel progress and cause deadlock.
Always use try_read and try_write.
Common mistakes
Wrong — async_mmap passed by value (deprecated):
void Kernel(tapa::async_mmap<float> mem) { /* ... */ } // missing &
Right — async_mmap passed by reference:
void Kernel(tapa::async_mmap<float>& mem) { /* ... */ }
Wrong — blocking read inside a pipelined loop:
// Wrong: blocks until data arrives, preventing address issuance
float val = mem.read_data.read();
Right — non-blocking read with availability check:
float val;
if (mem.read_data.try_read(val)) {
// process val
}
See also
Next step: Software Simulation
Software Simulation
Purpose: Run software simulation to verify your TAPA design's logic without FPGA hardware.
When to use this: Before synthesizing — software simulation is fast (seconds) and requires only a C++ compiler and the TAPA library.
What you need
- A compiled TAPA host executable (produced by
tapa g++) - No FPGA, no Vivado, no XRT required
Commands
Run the executable with no --bitstream argument. TAPA detects the missing argument and runs the software simulation:
./vadd
For reproducible output when debugging ordering-sensitive behavior, pin the simulation to a single thread:
TAPA_CONCURRENCY=1 ./vadd
TAPA_CONCURRENCY defaults to the physical CPU core count. Set it to 1 for reproducible task scheduling at the cost of simulation speed.
Expected output
I20000101 00:00:00.000000 0000000 task.h:66] running software simulation with TAPA library
kernel time: 1.19429 s
PASS!
The log line confirms the software simulation path was taken. PASS! is printed by the application when its correctness check succeeds.
Stream logging
To capture the values flowing through every tapa::stream channel, set TAPA_STREAM_LOG_DIR before running:
TAPA_STREAM_LOG_DIR=/tmp/logs ./vadd
TAPA writes one log file per stream. The format depends on the element type:
- Primitive types (
int,float, …) are logged as human-readable text, one value per line. For example, writing42to atapa::stream<int>produces42\n. - Non-primitive types without
operator<<are logged in hex with little-endian byte order. For example, writingFoo{0x4222}to atapa::stream<Foo>produces0x22420000\n. - Non-primitive types with
operator<<defined are logged using that operator, producing human-readable text.
Why coroutine simulation is more accurate than Vitis HLS simulation
Vitis HLS software simulation runs each task sequentially in a single thread. The tasks take turns executing to completion before the next one starts. This means races between concurrent tasks are invisible — the simulation passes even when tasks make assumptions about each other's execution order that will not hold in real hardware.
TAPA uses coroutine-based simulation: all tasks run on the same thread but yield cooperatively at stream blocking points. When a task calls read() on an empty stream, it suspends and another task runs. This models the concurrent, backpressure-driven semantics of hardware much more faithfully. Bugs that manifest in hardware because two tasks execute simultaneously are far more likely to surface during TAPA software simulation than during Vitis HLS software simulation.
This is also why TAPA enforces stream depth in software simulation: a producer that fills a depth-2 FIFO will block in TAPA simulation, just as it would in hardware.
Debugging with GDB
Software simulation runs as ordinary host code, so GDB works as normal:
gdb ./vadd
Then set a breakpoint on any TAPA task function by name:
(gdb) b VecAdd
(gdb) run
Breakpoints, watchpoints, and backtraces all work because every task runs as a coroutine on the host CPU.
Validation
Simulation is correct when:
- The program exits with code 0.
- The application's own correctness check prints
PASS!(or your application's equivalent). - No deadlock or hang occurs within the expected runtime.
If something goes wrong
If the simulation hangs indefinitely, a stream deadlock is likely. See Deadlocks & Hangs for diagnosis steps.
For unexpected errors or assertion failures, see Common Errors.
Next step: Fast Hardware Simulation
Fast Hardware Simulation
Purpose: Validate RTL correctness faster than Vitis cosimulation using TAPA's fast cosim.
When to use this: After tapa compile produces a .xo file, before the multi-hour v++ --link step. Fast cosim catches logic bugs in generated RTL in seconds rather than the ten-plus minutes Vitis cosimulation requires.
What you need
- A
.xokernel object fromtapa compile(or a.zipfor thexilinx-hlstarget) - One of:
- xsim: Requires a Vivado installation. Linux only.
- verilator: Open-source. Works on Linux and macOS. No Vivado required.
Commands
Basic run
Pass the .xo file as the --bitstream argument:
./vadd --bitstream VecAdd.xo 1000
For the xilinx-hls target, a .zip file also works:
./vadd --bitstream VecAdd.zip 1000
Choosing a simulator backend
The default backend is xsim. To switch to Verilator:
./vadd --bitstream VecAdd.xo -cosim_simulator verilator 1000
Saving waveforms
Specify a persistent work directory and enable waveform saving:
./vadd --bitstream VecAdd.xo \
-cosim_work_dir ./cosim_work \
-xsim_save_waveform \
1000
Strongly recommended: pair -xsim_save_waveform with -cosim_work_dir. Without a persistent work directory, fast cosim uses a temporary directory that is deleted at exit, removing any saved waveforms with it.
Setup-only and resume workflow
When you want to inspect the generated simulation environment before committing to a full run:
# Step 1: set up the simulation environment and stop before running
./vadd --bitstream VecAdd.xo \
-cosim_work_dir ./cosim_work \
-cosim_setup_only \
1000
# Step 2: after inspecting, run post-simulation checks without re-simulating
./vadd --bitstream VecAdd.xo \
-cosim_work_dir ./cosim_work \
-cosim_resume_from_post_sim \
1000
Parallel runs
When a host application calls tapa::invoke more than once — for example, a pipeline split into separate kernels each compiled to its own .xo file — TAPA launches all cosim instances concurrently. Each kernel is compiled independently and its .xo path is passed to its own tapa::invoke call via a separate bitstream flag:
// Host code: two separate kernels, each with its own bitstream flag
DEFINE_string(producer_bitstream, "", "XO for Producer kernel");
DEFINE_string(consumer_bitstream, "", "XO for Consumer kernel");
tapa::invoke(Producer, FLAGS_producer_bitstream, ...);
tapa::invoke(Consumer, FLAGS_consumer_bitstream, ...);
./app --producer_bitstream=producer.xo --consumer_bitstream=consumer.xo
If all instances share the same -cosim_work_dir, their simulation environments collide. Pass -cosim_work_dir_parallel to give each instance its own uniquely named subdirectory:
./app \
--producer_bitstream=producer.xo \
--consumer_bitstream=consumer.xo \
-cosim_work_dir ./cosim_work \
-cosim_work_dir_parallel
TAPA creates ./cosim_work/XXXXXX/ (a unique name per instance) so that the simulations run without interfering with each other's build artifacts.
Runtime flags reference
The following flags control fast cosim behavior when passed to the host executable. The canonical reference is Runtime Flags.
| Flag | Description |
|---|---|
-cosim_executable <path> | Deprecated. Fast cosim now runs in-process via libfrt; this flag is ignored. |
-xsim_part_num <part> | Target FPGA part number for simulation (e.g., xcu280-fsvh2892-2L-e). |
-cosim_work_dir <dir> | Persistent working directory for simulation artifacts. Without this, a temporary directory is used and deleted after the run. |
-xsim_save_waveform | Save simulation waveforms to a .wdb file in the work directory. Requires -cosim_work_dir. |
-xsim_start_gui | Open the Vivado GUI for interactive debugging during simulation. |
-cosim_simulator <backend> | Simulator backend: xsim (default, Linux only) or verilator (cross-platform). |
-cosim_setup_only | Run simulation setup only, then stop before executing the simulation. |
-cosim_resume_from_post_sim | Skip re-running the simulation; jump directly to post-simulation checks. |
-cosim_work_dir_parallel | Create a unique subdirectory per instance when running concurrent simulations. |
Expected output
Fast cosim completes in seconds for simple designs. A successful run prints the application's correctness result (e.g., PASS!) after the simulation finishes.
Debugging frozen simulations
If the simulation becomes unresponsive:
- Run with
-cosim_work_dirto persist intermediate files. - Abort the simulation with Ctrl-C.
- Locate
[work-dir]/output/run/run_cosim.tcl. - Open Vivado in GUI mode and source the script:
vivado -mode gui -source [work-dir]/output/run/run_cosim.tcl
This allows real-time observation and waveform analysis of the frozen state.
Cross-channel access for HBM is not currently supported in fast cosimulation. Each AXI interface can only access one HBM channel.
If something goes wrong
See Cosimulation Issues for diagnosis steps covering xsim hangs, Verilator build errors, and waveform debugging.
Next step: Vitis Cosimulation
Parallel RTL Emulation
Purpose: Run cycle-accurate RTL simulation for each kernel module concurrently, reducing total cosim time while preserving cycle-accurate behavior where it matters.
RTL cosimulation gives you cycle-accurate behavior for the logic inside each kernel — pipeline depths, stall conditions, II violations, and hazards that software simulation cannot catch. It does not give you cycle-accurate behavior between kernels: the FIFOs connecting separate cosim processes are shared-memory queues, and memory (mmap/async_mmap) latency is similarly abstracted. Parallel RTL emulation is therefore most valuable for validating the cycle-sensitive internals of individual kernels, not end-to-end timing across the full datapath.
Running one cosim process per kernel and launching them concurrently reduces wall-clock time compared to simulating everything in a single process or sequentially.
Concept
In a standard TAPA design, one top-level function is compiled into one .xo and the entire design is simulated as a single cosim process. In the parallel emulation pattern:
- Each kernel function is compiled to its own
.xowithtapa compile --top <KernelFunc>. - The host application defines a separate bitstream flag per kernel and passes each to
.invoke()wrapped intapa::executable. tapa::tasklaunches all kernel simulations concurrently; streams between kernels communicate through shared memory files managed by the runtime.
┌────────────────────────────────────────────────────────┐
│ Host application │
│ │
│ tapa::task() │
│ .invoke(KernelA, tapa::executable(FLAGS_a_bs), ...) │──▶ cosim process A
│ .invoke(KernelB, tapa::executable(FLAGS_b_bs), ...) │──▶ cosim process B
│ .invoke(KernelC, tapa::executable(FLAGS_c_bs), ...) │──▶ cosim process C
└────────────────────────────────────────────────────────┘
streams between kernels → shared-memory FIFOs (not cycle-accurate)
API
tapa::executable
Wraps a path to a kernel .xo (or .zip for the xilinx-hls target). When passed as the second argument to .invoke(), the runtime launches RTL emulation for that invocation instead of running it in software simulation.
class executable {
public:
explicit executable(std::string path);
// Not copyable or movable.
};
If the path is empty, .invoke() falls back to software simulation for that kernel. This lets a single binary select simulation or emulation per-kernel at runtime.
tapa::task::invoke with tapa::executable
// Kernel-specific override: run KernelFunc from the given XO file.
task& invoke(Func&& func, tapa::executable exe, Args&&... args);
All .invoke() calls in a tapa::task() chain start concurrently. Kernels that receive a tapa::executable each get their own cosim process; kernels without one run as software coroutines.
tapa::executable must be provided before any argument that is a direct stream reader or writer. The runtime uses the executable path to bind the right simulation backend before it can connect streams.
Compiling Each Kernel
Each kernel function is compiled independently. Invoke tapa compile once per top function, passing its name via --top:
tapa compile \
--top Scatter \
--part-num xcu280-fsvh2892-2L-e \
--clock-period 3.33 \
-f cannon.cpp \
-o scatter.xo
tapa compile \
--top ProcElem \
--part-num xcu280-fsvh2892-2L-e \
--clock-period 3.33 \
-f cannon.cpp \
-o proc-elem.xo
tapa compile \
--top Gather \
--part-num xcu280-fsvh2892-2L-e \
--clock-period 3.33 \
-f cannon.cpp \
-o gather.xo
All three compilations can share the same source file. Each produces an independent .xo that knows only its own top function's interface.
Host Code
The host application follows the standard TAPA pattern, but uses one DEFINE_string per kernel rather than a single --bitstream flag:
#include <gflags/gflags.h>
#include <tapa.h>
DEFINE_string(scatter_bitstream, "",
"path to Scatter XO; empty = software simulation");
DEFINE_string(proc_elem_bitstream, "",
"path to ProcElem XO; empty = software simulation");
DEFINE_string(gather_bitstream, "",
"path to Gather XO; empty = software simulation");
int main(int argc, char* argv[]) {
gflags::ParseCommandLineFlags(&argc, &argv, true);
// ... allocate buffers ...
tapa::invoke(TopFunction, /*bitstream=*/"",
tapa::read_only_mmap<const float>(a),
tapa::read_only_mmap<const float>(b),
tapa::write_only_mmap<float>(c), n);
}
The TopFunction assembles the task graph. Each .invoke() receives its own tapa::executable:
void TopFunction(tapa::mmap<const float> a_vec,
tapa::mmap<const float> b_vec,
tapa::mmap<float> c_vec, uint64_t n) {
tapa::streams<float, 4> a("a");
tapa::streams<float, 4> b("b");
tapa::streams<float, 4> c("c");
// ... declare inter-kernel streams ...
tapa::task()
.invoke(Scatter, tapa::executable(FLAGS_scatter_bitstream), a_vec, a)
.invoke(Scatter, tapa::executable(FLAGS_scatter_bitstream), b_vec, b)
.invoke(ProcElem, tapa::executable(FLAGS_proc_elem_bitstream), a, b, c, ...)
// ... more ProcElem instances ...
.invoke(Gather, tapa::executable(FLAGS_gather_bitstream), c_vec, c);
}
Streams declared inside TopFunction are host-side objects. The runtime passes references to the same shared-memory FIFO to each cosim process that reads or writes it, so data flows between kernels exactly as it would on hardware.
Running
Pass the compiled .xo files to the host binary:
./cannon \
--scatter_bitstream=scatter.xo \
--proc_elem_bitstream=proc-elem.xo \
--gather_bitstream=gather.xo
When any flag is empty the corresponding kernel runs in software simulation. This lets you emulate a subset of the design while the rest runs in simulation:
# Only emulate ProcElem; Scatter and Gather run in software simulation.
./cannon --proc_elem_bitstream=proc-elem.xo
Work directory
By default each cosim process writes to a temporary directory that is deleted at exit. Provide -cosim_work_dir to retain artifacts. When multiple kernels share the same work directory their simulation environments collide; use -cosim_work_dir_parallel to give each process a unique subdirectory:
./cannon \
--scatter_bitstream=scatter.xo \
--proc_elem_bitstream=proc-elem.xo \
--gather_bitstream=gather.xo \
-cosim_work_dir ./cosim_work \
-cosim_work_dir_parallel
TAPA creates ./cosim_work/XXXXXX/ (a unique name per instance) so the simulations do not interfere with each other.
Simulator backend
The same -cosim_simulator flag applies to all instances:
./cannon \
--scatter_bitstream=scatter.xo \
--proc_elem_bitstream=proc-elem.xo \
--gather_bitstream=gather.xo \
-cosim_simulator verilator
Controlling concurrency
Set TAPA_CONCURRENCY to limit how many cosim processes run simultaneously. This is useful on machines with limited memory:
TAPA_CONCURRENCY=1 ./cannon \
--scatter_bitstream=scatter.xo \
--proc_elem_bitstream=proc-elem.xo \
--gather_bitstream=gather.xo
At TAPA_CONCURRENCY=1 the processes still exchange data correctly through shared-memory FIFOs, but only one simulation runs at a time.
Runtime flags reference
| Flag | Description |
|---|---|
-cosim_work_dir <dir> | Persistent working directory for simulation artifacts. |
-cosim_work_dir_parallel | Create a unique subdirectory per instance. Required when multiple kernels share -cosim_work_dir. |
-cosim_simulator <backend> | xsim (default, Linux only) or verilator (cross-platform). Applied to all instances. |
-xsim_save_waveform | Save simulation waveforms. Pair with -cosim_work_dir. |
-cosim_executable <path> | Deprecated. Fast cosim now runs in-process via libfrt; this flag is ignored. |
-xsim_part_num <part> | Target FPGA part number (e.g., xcu280-fsvh2892-2L-e). |
TAPA_CONCURRENCY | Environment variable. Limits the number of cosim processes that run simultaneously. |
Full example: Cannon matrix multiply
The tests/functional/parallel-emulation/ directory in the TAPA repository contains a working parallel-emulation example. The Cannon algorithm splits into three kernels:
| Kernel | Role |
|---|---|
Scatter (×2) | Distributes rows of matrices A and B into per-PE stream arrays |
ProcElem (×p²) | Each PE computes its sub-matrix tile and shifts blocks to neighbours |
Gather (×1) | Collects results from all PEs into the output matrix |
Compile (three invocations from one source file):
tapa compile --top Scatter -f cannon.cpp -o scatter.xo --part-num xcu280-fsvh2892-2L-e --clock-period 3.33
tapa compile --top ProcElem -f cannon.cpp -o proc-elem.xo --part-num xcu280-fsvh2892-2L-e --clock-period 3.33
tapa compile --top Gather -f cannon.cpp -o gather.xo --part-num xcu280-fsvh2892-2L-e --clock-period 3.33
Run:
./cannon-host \
--scatter_bitstream=scatter.xo \
--proc_elem_bitstream=proc-elem.xo \
--gather_bitstream=gather.xo \
-cosim_work_dir ./cosim_work \
-cosim_work_dir_parallel
A successful run prints PASS! after all simulation processes finish.
See also: Fast Hardware Simulation — single-kernel cosim with the same -cosim_* and -xsim_* flags.
Vitis Cosimulation
Purpose: Run full Vitis hardware emulation for accurate timing after fast cosim passes.
When to use this: When you need accurate timing or bandwidth numbers that fast cosim cannot provide. This step is slow (5–10 minutes for simple designs) and is rarely the first choice — run Fast Hardware Simulation first to catch logic errors.
What you need
- A
.xokernel object fromtapa compile - Vitis and XRT installed (Linux only)
- The target platform string (e.g.,
xilinx_u280_xdma_201920_3)
Commands
Generate the hardware emulation bitstream
platform=xilinx_u280_xdma_201920_3
v++ -o vadd.$platform.hw_emu.xclbin \
--link \
--target hw_emu \
--kernel VecAdd \
--platform $platform \
vadd.$platform.hw.xo
Replace $platform with your actual target platform string and VecAdd with your top-level kernel name. This step typically takes 5–10 minutes.
Run the hardware emulation
./vadd --bitstream=vadd.$platform.hw_emu.xclbin 1000
The same host executable used for software simulation and fast cosim runs unchanged here — only the --bitstream argument changes.
Expected output
INFO: Loading vadd.xilinx_u250_xdma_201830_2.hw_emu.xclbin
INFO: Found platform: Xilinx
INFO: Found device: xilinx_u250_xdma_201830_2
INFO: Using xilinx_u250_xdma_201830_2
INFO: [HW-EMU 01] Hardware emulation runs simulation underneath. Using a large data set will result in long simulation times. It is recommended that a small dataset is used for faster execution. The flow uses approximate models for DDR memory and interconnect and hence the performance data generated is approximate.
...
INFO: [HW-EMU 06-0] Waiting for the simulator process to exit
INFO: [HW-EMU 06-1] All the simulator processes exited successfully
elapsed time: 31.0901 s
PASS!
Vitis hardware emulation uses approximate models for DDR memory and interconnects. Performance numbers from hw_emu are indicative, not exact. For precise measurements, run on an actual board using an hw bitstream.
Validation
The run is correct when:
- The
INFO: [HW-EMU 06-1] All the simulator processes exited successfullyline appears. - The application's correctness check prints
PASS!. - The elapsed time is reported (confirming the kernel actually executed).
Use a small dataset for hardware emulation runs. Large datasets cause proportionally long simulation times because every clock cycle is simulated in software.
If something goes wrong
See Cosimulation Issues for diagnosis steps. Common issues include missing XRT environment variables, platform string mismatches, and kernel name mismatches between the --kernel flag and the TAPA top-level function name.
Next step: Build & Run on Board
Build & Run on Board
Purpose: Build a TAPA design into an FPGA bitstream and run it on an Alveo board.
When to use this: After fast cosim (and optionally Vitis cosim) passes — this step converts your .xo kernel object into a hardware bitstream and executes it on real silicon.
What you need
- A
.xokernel object fromtapa compile - Vitis and XRT installed (Linux only)
- The target platform string (e.g.,
xilinx_u280_xdma_201920_3) - An Alveo board installed in the system for the final execution step
- Several hours of compute time for
v++ --link
Stage 1: Compile the kernel with TAPA
If you do not already have a .xo, produce it with tapa compile:
platform=xilinx_u280_xdma_201920_3
tapa \
--work-dir work.out \
compile \
--top VecAdd \
--part-num xcu280-fsvh2892-2L-e \
--clock-period 3.33 \
-f vadd.cpp \
-o vadd.$platform.hw.xo
The .xo file is the artifact that feeds v++.
Stage 2: Link into an FPGA bitstream
v++ -o vadd.$platform.hw.xclbin \
--link \
--target hw \
--kernel VecAdd \
--platform $platform \
vadd.$platform.hw.xo
This step takes several hours depending on design complexity and host machine performance. Plan accordingly and consider running it on a dedicated build server (see Remote Execution).
The output artifact is vadd.$platform.hw.xclbin — this is the bitstream loaded onto the FPGA.
Key alignment rules:
--kernel VecAddmust match the top-level function name in your TAPA source.--platform $platformmust match the platform string used intapa compile --part-num.- The input
.xofilename (vadd.$platform.hw.xo) must be the file produced bytapa compile.
Stage 3: Execute on the FPGA
The same host executable used for software and hardware simulation runs on board:
./vadd --bitstream=vadd.$platform.hw.xclbin
Expected output
INFO: Found platform: Xilinx
INFO: Found device: xilinx_u280_xdma_201920_3
INFO: Using xilinx_u280_xdma_201920_3
...
elapsed time: 7.48926 s
PASS!
On-board execution is substantially faster than hardware emulation. The elapsed time includes FPGA reconfiguration time (loading the bitstream).
Validation
The run is correct when:
- XRT finds and selects the expected device.
- The elapsed time is reported.
- The application's correctness check prints
PASS!.
If you use std::vector for memory-mapped buffers, XRT may warn about unaligned host pointers, which causes an extra memory copy. To eliminate the copy, use std::vector<T, tapa::aligned_allocator<T>> instead.
If something goes wrong
See Common Errors for diagnosis steps. Common issues include XRT not finding the device, platform string mismatches, and bitstream generated for a different platform than the installed board.
Next step: Remote Execution
Remote Execution
Purpose: Offload TAPA vendor-tool steps to a remote Linux machine over SSH.
When to use this: When your development machine is macOS (where Xilinx/AMD tools are unavailable) or when you want to delegate long-running HLS synthesis and implementation steps to a dedicated Linux build server.
What you need
- SSH access to a Linux machine with Vitis HLS and/or Vivado installed
- The path to
settings64.shon the remote machine - TAPA installed locally (the
tapa analyzestep always runs locally)
How remote execution works
TAPA splits work between local and remote:
| Step | Runs where |
|---|---|
tapa analyze (runs tapa-cpp and tapacc) | Always local |
tapa synth (Vitis HLS synthesis) | Remote when --remote-host is set |
tapa pack (IP packaging) | Remote when --remote-host is set |
Host fast-cosim runtime (--bitstream=*.xo) | Remote when --remote-host is set |
File transfer (.xo, .zip artifacts) | Handled automatically by TAPA |
Commands
Inline remote flags
tapa \
--work-dir work.out \
--remote-host alice@build-server.example.com:22 \
--remote-key-file ~/.ssh/id_ed25519 \
--remote-xilinx-settings /opt/Xilinx/Vitis/2024.1/settings64.sh \
compile \
--top VecAdd \
--part-num xcu280-fsvh2892-2L-e \
--clock-period 3.33 \
-f vadd.cpp \
-o vadd.xo
Parallel HLS jobs on the remote host
Use -j to run up to N Vitis HLS processes in parallel on the remote machine:
tapa \
--work-dir work.out \
--remote-host alice@build-server.example.com \
--remote-xilinx-settings /opt/Xilinx/Vitis/2024.1/settings64.sh \
synth \
-j 8 \
...
TAPA_CONCURRENCY and -j are different controls:
TAPA_CONCURRENCYcontrols the number of parallel software-simulation threads used by the host runtime during functional simulation (tapa::invokewith no bitstream). It has no effect on HLS or remote execution.-j(passed totapa synth) controls how many Vitis HLS processes run in parallel on the remote host.
Keep -j at or below the number of cores available on the remote machine.
Reusing the SSH connection
To avoid establishing a new TCP connection on every tapa invocation, use connection multiplexing with a persistent socket directory:
tapa \
--work-dir work.out \
--remote-host alice@build-server.example.com \
--remote-ssh-control-dir ~/.ssh/tapa-mux \
--remote-ssh-control-persist 4h \
--remote-xilinx-settings /opt/Xilinx/Vitis/2024.1/settings64.sh \
compile \
...
The master connection stays alive for 4 hours after the last client closes. Subsequent tapa invocations within that window reuse the existing TCP connection.
Remote flags reference
| Flag | Description |
|---|---|
--remote-host user@host[:port] | Remote Linux host for vendor tools. Omit user to use the current local username; omit port to use 22. |
--remote-key-file PATH | SSH private key for authentication. Defaults to the SSH agent or ~/.ssh/id_rsa. |
--remote-xilinx-settings PATH | Path to settings64.sh on the remote host. TAPA sources this before invoking Vitis HLS. |
--remote-ssh-control-dir DIR | Local directory for OpenSSH multiplex control sockets. Share across invocations to reuse the master connection. |
--remote-ssh-control-persist DURATION | How long the master socket stays alive after the last connection closes (e.g., 30m, 4h). Default: 30m. |
--remote-disable-ssh-mux | Disable SSH connection multiplexing. Each SSH/SCP call opens a fresh connection. Use this when the remote host or a proxy does not support ControlMaster. |
Persistent configuration via ~/.taparc
Instead of repeating remote flags on every invocation, store them in ~/.taparc:
remote:
host: build-server.example.com
user: alice
port: 22
key_file: ~/.ssh/id_ed25519
xilinx_settings: /opt/Xilinx/Vitis/2024.1/settings64.sh
work_dir: /tmp/tapa-remote
ssh_control_dir: ~/.ssh/tapa-mux
ssh_control_persist: 4h
ssh_multiplex: true
CLI flags always override the corresponding ~/.taparc values. In particular, --remote-host replaces the host, user, and port fields from the config file.
Validation
After a successful remote compile, the .xo artifact is automatically transferred back to your local machine. Check for it:
ls -lh vadd.xo
TAPA prints transfer progress and the remote Vitis HLS log to standard output during the run.
If something goes wrong
SSH connection refused or timeout: Verify the host, port, and that your key is accepted with ssh -i ~/.ssh/id_ed25519 alice@build-server.example.com.
settings64.sh not found: Confirm the path is correct on the remote machine with ssh alice@build-server.example.com ls /opt/Xilinx/Vitis/2024.1/settings64.sh.
ControlMaster errors: If the remote host or an intermediary proxy does not support SSH multiplexing, add --remote-disable-ssh-mux to your invocation.
Port conflicts with ~/.taparc: If you omit the port in --remote-host, TAPA defaults to port 22 — it does not fall back to the port field from ~/.taparc. Always include the port explicitly (e.g., user@host:2222) when the remote host listens on a non-standard port.
Next step: Using the Visualizer
Using the Visualizer
Purpose: Inspect your TAPA design's task graph and dataflow using the visualizer.
When to use this: When you want to understand the task hierarchy and stream connections in your design, trace data flows between tasks, or navigate complex hierarchical designs.
What you need
- A
graph.jsonfile generated bytapa compile(found in the work directory underwork.out/) - A modern web browser (Chrome, Edge, Firefox, or other Chromium/Firefox-based browser)
- The TAPA Visualizer web app — build it from the
tapa-visualizer/directory in the TAPA repository
Commands
- Run
tapa compilewith a--work-dirto producegraph.json:tapa --work-dir work.out compile --top VecAdd ... - Open the TAPA Visualizer in your browser.
- Click the Choose File input in the top-left corner and select
work.out/graph.json.
The graph loads and renders automatically after file selection.

Interface components
Top toolbar
The toolbar provides controls for working with the graph:
File controls:
- Choose File — select a
graph.jsonfile to load. - Clear Graph — remove the current graph from the view.
Sub-task display modes — three modes control how task instances are shown:
| Mode | Description |
|---|---|
| Merge Sub-task | One node per task type; all instances merged into a single node. |
| Separate Sub-task | One node per instance, named taskname/0, taskname/1, with connections named connection/0, connection/1, etc. |
| Expand Sub-task | One node per actual sub-task instance, each with its own sibling tree rather than being merged. |

The image above shows (left to right) Merge, Separate, and Expand modes. Notice the Load combo in the top-left: Mmap2Stream has 2 sub-tasks, which appear differently in each mode.
Action buttons:
- Rerender Graph — re-lays out the graph and fits it to the view. Useful for large graphs or when using progressive layout algorithms like ForceAtlas2.
- Fit Center — centers the graph in the view.
- Fit View — centers and resizes the graph to fit the current viewport.
- Save Image — exports the current graph as an image file.
- Toggle Sidebar — shows or hides the information sidebar.
Interactive graph
The graph represents your TAPA design as a hierarchical, directed graph:
- Nodes represent tasks. Color indicates connectivity: nodes with only incoming or outgoing connections appear in lighter colors; nodes with both appear darker.
- Edges represent connections (typically FIFO streams) between tasks.
- Combos (rectangular container areas) represent upper-level tasks containing nested tasks.
Supported interactions:
| Interaction | Effect |
|---|---|
| Click an element | Displays its details in the sidebar. |
| Drag a node | Repositions the node. |
| Double-click a combo | Expands or collapses its contents. |
| Drag the background | Pans the view. |
| Shift+drag | Box selection. |
| Ctrl+drag | Lasso selection. |

Sidebar

The sidebar provides detailed information through several tabs:
| Tab | Contents |
|---|---|
| Explorer | Hierarchical list of all tasks and sub-tasks; use it to quickly navigate complex designs. |
| Cflags | The compiler flags passed when building the graph. |
| Details | Comprehensive information about the currently selected element: task properties, parameters, and connectivity. |
| Connections | All connections and neighboring tasks for the selected element; useful for tracing data flows. |
| Options | Additional visualization settings: layout algorithm, task expansion options, and connection port visibility. |
Validation
The visualizer is working correctly when:
- The graph renders with nodes and edges visible after loading
graph.json. - Clicking a node or edge populates the Details tab in the sidebar.
- Double-clicking a combo expands or collapses its contents.
Browser compatibility
| Category | Browsers |
|---|---|
| Fully supported | Chrome, Edge, and other Chromium-based browsers; Firefox and Firefox-based browsers |
| Partially supported | Safari and other WebKit-based browsers (should work but not extensively tested) |
| Unsupported | Internet Explorer and browsers not updated within the past 12 months |
Using a modern, up-to-date browser is essential for both TAPA Visualizer compatibility and general web security.
If something goes wrong
If the graph fails to load or renders blank, check that graph.json was produced by tapa compile and is not empty. See Common Errors for further diagnosis.
Next step: Performance Tuning
Performance Tuning
Purpose: Identify and fix throughput bottlenecks in your TAPA design.
When to use this: When your design builds and runs correctly but measured throughput is below your target — for example, the kernel time is higher than expected or resource utilization is unexpectedly high.
What you need
- A compiled
.xofromtapa compile --work-dir work.out - Reports in
work.out/(synthesis reports, utilization data) - Understanding of your design's expected throughput
Prioritized checklist
Work through these checks in order — each is faster to fix than the next.
1. Check initiation interval (II) in synthesis reports
After tapa compile, check the HLS reports in work.out/ for II violations:
- An II > 1 on a pipelined loop means the loop is not fully pipelined and throughput is reduced.
- Look for
WARNING: [HLS ...] Unable to scheduleorII = Nwhere N > 1 in the HLS log.
Fix: Add #pragma HLS pipeline II=1 or restructure the loop body to eliminate data-path dependencies.
2. Check memory throughput — consider async_mmap
Synchronous mmap accesses stall the task until each memory transaction completes. If your task spends time waiting for DRAM:
- Use
tapa::async_mmapto overlap computation and memory access. - Check the synthesis report for memory interface utilization.
3. Check stream depths — FIFOs too shallow?
FIFOs that are too shallow cause backpressure and reduce throughput when producer and consumer tasks run at different rates. If tasks are frequently stalling:
- Increase the stream depth in your TAPA source:
tapa::stream<T, DEPTH>. - Check waveforms from fast cosim (
-xsim_save_waveform) to observe backpressure.
4. Find resource hotspots with --enable-synth-util
Run synthesis with utilization reporting enabled:
tapa --work-dir work.out synth \
--enable-synth-util \
--part-num xcu280-fsvh2892-2L-e \
--clock-period 3.33
TAPA runs an additional RTL synthesis pass and writes per-task resource counts to:
work.out/report.json— machine-readable JSONwork.out/report.yaml— human-readable YAML
Both files contain per-task LUT, FF, BRAM, and DSP counts. Use them to identify which tasks are consuming the most resources before proceeding to full implementation.
Validation
After running tapa synth --enable-synth-util, confirm the reports were written:
ls work.out/report.json work.out/report.yaml
work.out/report.json— machine-readable per-task resource counts (LUT, FF, BRAM, DSP)work.out/report.yaml— human-readable version of the same data
If these files are missing, synthesis either did not run or exited before the reporting step. Check the HLS log in work.out/ for errors.
Advanced synthesis flags
Controlling FIFO pipelining for floorplanning
By default, TAPA inserts pipeline registers into stream FIFOs to improve timing. When grouping FIFOs with their adjacent logic inside a single floorplan region, suppress pipelining for specific FIFOs:
tapa synth --nonpipeline-fifos fifos.json ...
fifos.json lists the FIFO names to suppress:
["fifo_a", "fifo_b"]
After synthesis, TAPA writes grouping_constraints.json to the work directory. Pass this file to RapidStream or other floorplanning tools.
AutoBridge graph generation
Generate an ab_graph.json for AutoBridge/RapidStream partition-based floorplanning:
tapa synth \
--gen-ab-graph \
--floorplan-config floorplan.json \
...
--floorplan-config is required when --gen-ab-graph is used. It specifies the target device floorplan regions.
GraphIR generation
Produce a GraphIR representation for RapidStream:
tapa synth \
--gen-graphir \
--device-config device.json \
--floorplan-path floorplan.json \
...
Both --device-config and --floorplan-path are required:
| Flag | Description |
|---|---|
--device-config PATH | JSON file describing the physical device (SLR layout, DSP column positions, etc.) |
--floorplan-path PATH | Floorplan assignment file applied to the program before GraphIR is emitted |
The output is work.out/graphir.json, suitable for consumption by RapidStream.
Advanced flags summary
| Flag | Description |
|---|---|
--enable-synth-util | Run post-HLS RTL synthesis to collect per-task resource utilization. |
--disable-synth-util | Do not run post-HLS RTL synthesis (default). |
--nonpipeline-fifos <json> | Suppress pipeline registers for listed FIFOs; write grouping_constraints.json. |
--gen-ab-graph | Generate ab_graph.json for AutoBridge/RapidStream floorplanning. Requires --floorplan-config. |
--floorplan-config PATH | Device floorplan region description. Required with --gen-ab-graph. |
--gen-graphir | Generate graphir.json for RapidStream. Requires --device-config and --floorplan-path. |
--device-config PATH | Physical device description for GraphIR conversion. Required with --gen-graphir. |
--floorplan-path PATH | Floorplan assignment applied before GraphIR emission. Required with --gen-graphir. |
If something goes wrong
See Common Errors for help with synthesis failures, II violation messages, and resource overflows.
Next step: Learning Path
Learning Path
These labs walk through the TAPA programming model from first principles to advanced topics. Each lab builds on the previous one — you will understand each concept more deeply if you complete them in order. Allow roughly four hours to work through all six labs.
Labs
| Lab | Topic | Prerequisites | Time | Skip if... |
|---|---|---|---|---|
| Lab 1: Vector Add | Core programming model | Your First Run | 20 min | You already understand task graphs and mmap |
| Lab 2: High-Bandwidth Memory | async_mmap for memory throughput | Lab 1 | 30 min | You only need basic mmap |
| Lab 3: Migrating from Vitis HLS | Porting existing HLS code | Lab 1 | 30 min | You are new to FPGA HLS |
| Lab 4: Custom RTL Modules | Integrating hand-written RTL | Lab 1 | 45 min | You don't need to integrate RTL |
| Lab 5: Parallel RTL Emulation | Multi-kernel concurrent cosimulation | Lab 1, Fast Hardware Simulation | 30 min | Your design is a single kernel |
| Lab 6: Floorplan & DSE | Floorplanning for multi-SLR FPGAs | Lab 2 | 60 min | You are not targeting multi-SLR devices |
Where to start
New to FPGA HLS — Start at Lab 1. It introduces the task graph model that every later lab assumes you understand.
Coming from Vitis HLS — Lab 3 covers the mechanical differences, but reading Lab 1 first is worthwhile because TAPA's concurrency model is structurally different from standard HLS. If you have already read the Programming Model page, you can go directly to Lab 3.
Already ran vadd in First Run — You have seen the commands; Lab 1 does the deep-dive explanation of why the code is structured the way it is. It is worth reading even if the output was correct.
Need HBM throughput — Work through Lab 2 (async_mmap) and then Lab 6 (floorplanning). Both are required to get full memory bandwidth on multi-SLR devices.
Building a multi-kernel pipeline — Lab 5 covers parallel RTL emulation, which lets you validate inter-kernel dataflow at RTL level before the bitstream link step.
Background reading
Before starting any lab, the Programming Model page covers the vocabulary used throughout: task graphs, streams, mmap, and the compile pipeline. The labs assume you have read at least the Programming Model page.
Start here: Lab 1: Vector Add
Lab 1: Vector Add
Goal: Understand why the VecAdd design is structured as four concurrent tasks connected by streams, and what each structural choice means for hardware generation.
Prerequisites: Complete Your First Run so that you have already built and run the vadd example. This lab explains what you ran — it does not repeat the run commands.
After this lab you will understand:
- How a top-level task orchestrates leaf tasks without containing computation
- How mmap and stream arguments express data movement
- How the host invocation connects host memory to the hardware kernel
Design overview
VecAdd computes c[i] = a[i] + b[i] for n elements. The implementation is a four-task pipeline:
Mmap2Stream(a) ──► a_q ──►
Add ──► c_q ──► Stream2Mmap(c)
Mmap2Stream(b) ──► b_q ──►
This is a producer-pipeline-consumer pattern. The two Mmap2Stream tasks read from global memory and feed elements into streams. Add consumes both streams and produces a result stream. Stream2Mmap drains the result stream back to global memory. All four tasks run concurrently once VecAdd is invoked — there is no sequencing between them.
The reason for this decomposition is not code style. TAPA generates separate hardware modules for each task, and the streams between them become FIFOs on the FPGA. When each stage is continuously supplied with data, the pipeline can run at full throughput.
Mmap2Stream
void Mmap2Stream(tapa::mmap<const float> mmap, uint64_t n,
tapa::ostream<float>& stream) {
for (uint64_t i = 0; i < n; ++i) {
stream << mmap[i];
}
}
tapa::mmap<const float> is passed by value, not by reference. This is a hard rule in TAPA: mmap arguments to leaf tasks must be passed by value. The const qualifier marks the memory as read-only, which causes the compiler to generate a read-only AXI master port during synthesis. See mmap for details.
Inside the loop, mmap[i] is array-style access to global memory. Each access becomes an AXI read transaction. The << operator writes the element to the output stream, blocking if the FIFO is full. HLS can pipeline this loop at II=1 when the memory access latency is hidden by the pipeline depth.
Add
void Add(tapa::istream<float>& a, tapa::istream<float>& b,
tapa::ostream<float>& c, uint64_t n) {
for (uint64_t i = 0; i < n; ++i) {
c << (a.read() + b.read());
}
}
Stream arguments are passed by reference. This is the mirror of the mmap rule: streams must be by reference, mmap must be by value. See Tasks for a full explanation.
a.read() blocks until an element is available in the FIFO. This is safe here because the loop runs exactly n times, and Mmap2Stream feeds exactly n elements into each stream. There is no risk of deadlock as long as the element counts match.
The << on the output stream blocks if the downstream FIFO (c_q) is full. That backpressure propagates through the pipeline: Add stalls, which causes a_q and b_q to fill, which eventually stalls both Mmap2Stream tasks. The pipeline self-regulates without any explicit flow control logic.
HLS can pipeline this loop at II=1 because the operations (two reads and one add) are independent across iterations.
Stream2Mmap
void Stream2Mmap(tapa::istream<float>& stream, tapa::mmap<float> mmap,
uint64_t n) {
for (uint64_t i = 0; i < n; ++i) {
stream >> mmap[i];
}
}
This is the mirror of Mmap2Stream. The >> operator reads one element from the stream (blocking) and writes it to global memory. The mmap is non-const this time because the output buffer is writable.
The same structural rules apply: mmap by value (non-const for write access), stream by reference.
VecAdd — the top-level task
void VecAdd(tapa::mmap<const float> a, tapa::mmap<const float> b,
tapa::mmap<float> c, uint64_t n) {
tapa::stream<float> a_q("a");
tapa::stream<float> b_q("b");
tapa::stream<float> c_q("c");
tapa::task()
.invoke(Mmap2Stream, a, n, a_q)
.invoke(Mmap2Stream, b, n, b_q)
.invoke(Add, a_q, b_q, c_q, n)
.invoke(Stream2Mmap, c_q, c, n);
}
VecAdd contains no computation — no arithmetic, no memory access, no loops. This is deliberate. Upper-level tasks in TAPA are orchestration-only: they declare streams, then launch child tasks. Putting computation in an upper-level task is not supported.
The tapa::stream<float> declarations create named FIFOs. The string names ("a", "b", "c") are used by TAPA's debug infrastructure: setting TAPA_STREAM_LOG_DIR causes TAPA to log every element transferred through each named stream, which is useful when tracking down data corruption.
The .invoke() chain starts all four child tasks simultaneously. TAPA does not sequence them — there is no "run Mmap2Stream first, then Add". All four tasks are live from the moment VecAdd is invoked, and they communicate entirely through the stream FIFOs. The task graph is what determines data ordering, not the order of .invoke() calls.
For a full description of the task graph model, see The Programming Model.
The .invoke() chain is syntactic sugar for constructing a tapa::task object and calling .invoke() on it repeatedly. Each call returns the same task object, which is why chaining works. The task object goes out of scope at the end of VecAdd, which causes TAPA to wait for all child tasks to finish before returning.
Host code
int64_t kernel_time_ns = tapa::invoke(
VecAdd, FLAGS_bitstream,
tapa::read_only_mmap<const float>(a),
tapa::read_only_mmap<const float>(b),
tapa::write_only_mmap<float>(c), n);
tapa::invoke is the host-side entry point. It is not the same as calling VecAdd() directly: calling VecAdd() would run it as a plain C++ function (software simulation without timing), while tapa::invoke selects the execution mode based on the bitstream path:
- Empty string (
"") — software simulation. TAPA runsVecAddas C++ but with stream and mmap semantics enforced by the runtime library. Fast, no FPGA required. .xofile — fast cosimulation. The synthesized RTL runs inside a cycle-accurate simulator. Useful for verifying timing-sensitive behavior..xclbinfile — hardware execution on a real FPGA.
tapa::read_only_mmap<const float>(a) wraps the host vector a and tells the runtime to transfer it to the FPGA as a read-only buffer. tapa::write_only_mmap<float>(c) marks c as write-only, so the runtime transfers results back after the kernel finishes. These are directives to the runtime about transfer direction — they do not add C++ access restrictions beyond what the type already expresses.
For the actual build and run commands, see Your First Run.
Rules summary
- Leaf task arguments: streams by reference (
tapa::istream<T>&,tapa::ostream<T>&), mmap by value (tapa::mmap<T>) - Upper-level tasks: declare streams with
tapa::stream<T>, invoke child tasks with.invoke(), contain no computation - Stream names (the string argument to
tapa::stream<T>) are used by the debug infrastructure and appear in error messages — always name your streams - mmap const-ness (
const floatvsfloat) determines whether the synthesized AXI master port is read-only or read-write; transfer direction at runtime is set separately byread_only_mmap/write_only_mmapon the host side
If you see a compilation error about streams being passed by value or mmap being passed by reference, check your task signatures. TAPA enforces these argument-passing conventions at compile time.
Next step: Lab 2: High-Bandwidth Memory
Lab 2: High-Bandwidth Memory with async_mmap
Goal: Achieve high DRAM throughput by overlapping multiple outstanding memory requests using async_mmap.
Prerequisites: Lab 1: Vector Addition and Memory Access: async_mmap
After this lab you will understand:
- Why sequential memory access wastes most of the available DRAM bandwidth
- How the two-counter loop pattern keeps multiple requests in flight simultaneously
- How to correctly coordinate the three write channels and drain
write_resp
The problem: one request at a time
With a plain mmap<T> argument, each read or write is a blocking operation. The loop below looks innocuous, but every iteration stalls waiting for data to return from DRAM before the next address is issued:
// Problematic: one outstanding request at a time
for (int i = 0; i < n; i++) {
result[i] = mem[i]; // blocks until data returns
}
Off-chip DRAM latency is typically 100–200 ns. At a 300 MHz clock that is 30–60 idle cycles per element. For sequential access patterns the HLS tool's burst inference may help, but for random-access patterns or when you need explicit control over request depth, mmap leaves most of the available bandwidth unused.
async_mmap solves this by exposing the five AXI channels directly as streams. You can issue many read addresses before any data returns, keeping dozens of requests in flight and hiding the per-request latency behind the steady flow of data. See Memory Access: async_mmap for the channel layout and area comparison.
Example 1: Overlapping reads with a single loop
The idiomatic TAPA read pattern uses two counters in a single pipelined loop:
void ReadKernel(tapa::async_mmap<float>& mem, float* result, uint64_t n) {
for (int64_t i_req = 0, i_resp = 0; i_resp < (int64_t)n;) {
#pragma HLS pipeline II=1
if (i_req < n && mem.read_addr.try_write(i_req)) ++i_req;
float val;
if (mem.read_data.try_read(val)) {
result[i_resp] = val;
++i_resp;
}
}
}
How it works:
i_reqtracks how many addresses have been issued;i_resptracks how many responses have been received.- The loop condition is
i_resp < n: it runs until every response is collected, not just until every address is sent. mem.read_addr.try_write(i_req)is non-blocking. If the address channel is full this cycle, it returns false and the address is retried on the next cycle.i_reqonly advances when the write succeeds.mem.read_data.try_read(val)is non-blocking. If no data has arrived yet, it returns false and the loop continues without blocking.- Because both branches are independent and non-blocking, the loop can issue a new address and receive a response in the same clock cycle.
- The difference
i_req - i_respis the current number of in-flight requests. The hardware limits this to the channel depth; TAPA coalesces sequential addresses into AXI bursts automatically at runtime, so you never need to write explicit burst logic.
Example 2: Sequential writes with burst detection
Writes require coordinating three channels: write_addr, write_data, and write_resp. The pattern checks all three are ready before committing:
void WriteKernel(tapa::async_mmap<float>& mem,
tapa::istream<float>& data, uint64_t n) {
for (int64_t i_req = 0, i_resp = 0; i_resp < (int64_t)n;) {
#pragma HLS pipeline II=1
if (i_req < n && !data.empty() &&
!mem.write_addr.full() && !mem.write_data.full()) {
mem.write_addr.try_write(i_req);
mem.write_data.try_write(data.read(nullptr));
++i_req;
}
uint8_t ack;
if (mem.write_resp.try_read(ack)) {
i_resp += unsigned(ack) + 1; // ack encodes burst length - 1
}
}
}
Key points:
- Before issuing a write, all three preconditions must hold: the input stream must have data, and neither the address nor the data channel may be full. Checking them together prevents partial commits.
write_respmust be consumed even if you do not use the count. The hardware stops accepting new write addresses once thewrite_respFIFO fills up, causing deadlock if the kernel never drains it.- The
ackvalue encodesburst_length - 1. TAPA detects that you are issuing sequential addresses and merges them into AXI bursts at runtime. A singlewrite_respentry can therefore acknowledge many writes, which is whyi_resp += unsigned(ack) + 1rather thani_resp += 1.
Rules for using async_mmap
- Pass
async_mmap<T>by reference (async_mmap<T>&). Passing by value is an error. - Only use
try_read/try_writeinside pipelined loops. Blockingread/writestalls the pipeline and will cause deadlock when combined with other non-blocking channels. - Always drain
write_resp, even if you discard the burst-length value. - An
mmap<T>argument can be passed to anasync_mmap<T>¶meter in a child task without changing the caller.
Never use blocking read/write on async_mmap channels inside a pipelined loop. Because the five AXI channels are decoupled, blocking on one channel prevents progress on the others and causes the kernel to hang.
For the full API reference and the area comparison table showing how async_mmap compares to the Vitis HLS m_axi interface, see Memory Access: async_mmap.
Next step: Lab 3: Migrating from Vitis HLS
Lab 3: Migrating from Vitis HLS
Goal: Port an existing Vitis HLS kernel to TAPA by replacing HLS-specific constructs with their TAPA equivalents.
Prerequisites: Lab 1: Vector Addition and familiarity with the TAPA task model.
After this lab you will understand:
- The mechanical substitutions that cover most Vitis HLS kernels
- Why the dataflow-in-a-loop pattern must be restructured in TAPA
- How
tapa::hls::streamsupports incremental migration of large codebases
Quick reference: Vitis HLS → TAPA
| Vitis HLS | TAPA | Notes |
|---|---|---|
#include <hls_stream.h> | #include <tapa.h> | TAPA includes its own stream types |
T* port + #pragma HLS INTERFACE m_axi | tapa::mmap<T> port (by value) | Remove all m_axi pragmas |
hls::stream<T>& | tapa::istream<T>& or tapa::ostream<T>& | Direction is explicit in TAPA |
#pragma HLS dataflow + direct calls | tapa::task().invoke(...) | Tasks run concurrently |
| Top function contains computation | Move computation into child tasks | TAPA upper-level tasks are orchestration-only |
hls::stream<T> local variable | tapa::stream<T> local variable | Same syntax; depth is enforced during software simulation (default depth: 2) |
Example 1: Basic VecAdd migration
The full before and after files are at example_1_before.cpp and example_1_after.cpp.
Step 1: Replace the include
-#include <hls_stream.h>
-#include <hls_vector.h>
+#include <hls_vector.h>
+#include <tapa.h>
TAPA provides its own stream types, so hls_stream.h is no longer needed. Other HLS headers such as ap_int.h and hls_vector.h are still supported and can be included as usual.
Step 2: Replace pointer arguments with tapa::mmap<T>
Vitis HLS uses raw pointers annotated with #pragma HLS INTERFACE m_axi to indicate off-chip memory. TAPA replaces this with tapa::mmap<T> passed by value, and no pragma is needed:
-void load_input(hls::vector<uint32_t, NUM_WORDS>* in,
+void load_input(tapa::mmap<hls::vector<uint32_t, NUM_WORDS>> in,
- hls::vector<uint32_t, NUM_WORDS>* in1,
- hls::vector<uint32_t, NUM_WORDS>* in2,
- hls::vector<uint32_t, NUM_WORDS>* out, int size) {
-#pragma HLS INTERFACE m_axi port = in1 bundle = gmem0
-#pragma HLS INTERFACE m_axi port = in2 bundle = gmem1
-#pragma HLS INTERFACE m_axi port = out bundle = gmem0
+ tapa::mmap<hls::vector<uint32_t, NUM_WORDS>> in1,
+ tapa::mmap<hls::vector<uint32_t, NUM_WORDS>> in2,
+ tapa::mmap<hls::vector<uint32_t, NUM_WORDS>> out, int size) {
tapa::mmap<T> supports element-indexed reads and writes (mem[i]) just like a pointer, so the body of each task usually does not need to change.
Step 3: Replace hls::stream<T>& with directional TAPA streams
Vitis HLS hls::stream<T>& is bidirectional — the same type is used whether the stream is read or written. TAPA makes direction explicit:
-void compute_add(hls::stream<hls::vector<uint32_t, NUM_WORDS>>& in1_stream,
- hls::stream<hls::vector<uint32_t, NUM_WORDS>>& in2_stream,
- hls::stream<hls::vector<uint32_t, NUM_WORDS>>& out_stream,
+void compute_add(tapa::istream<hls::vector<uint32_t, NUM_WORDS>>& in1_stream,
+ tapa::istream<hls::vector<uint32_t, NUM_WORDS>>& in2_stream,
+ tapa::ostream<hls::vector<uint32_t, NUM_WORDS>>& out_stream,
Use tapa::istream<T>& for streams the task reads from, and tapa::ostream<T>& for streams the task writes to. The read() and << operators work the same as in Vitis HLS.
Step 4: Replace local hls::stream<T> declarations
Local streams declared inside the top-level function become tapa::stream<T>:
- hls::stream<hls::vector<uint32_t, NUM_WORDS>> in1_stream("input_stream_1");
- hls::stream<hls::vector<uint32_t, NUM_WORDS>> in2_stream("input_stream_2");
- hls::stream<hls::vector<uint32_t, NUM_WORDS>> out_stream("output_stream");
+ tapa::stream<hls::vector<uint32_t, NUM_WORDS>> in1_stream("input_stream_1");
+ tapa::stream<hls::vector<uint32_t, NUM_WORDS>> in2_stream("input_stream_2");
+ tapa::stream<hls::vector<uint32_t, NUM_WORDS>> out_stream("output_stream");
tapa::stream<T> accepts a name string for the same debugging purpose as hls::stream<T>. To set a custom depth, use tapa::stream<T, DEPTH>. For stream arrays, use tapa::streams<T, ARRAY_SIZE, DEPTH>.
The default stream depth in TAPA is 2, matching the Vitis HLS default. Unlike Vitis HLS, TAPA enforces the depth during software simulation, which helps catch backpressure bugs before synthesis.
Step 5: Replace #pragma HLS dataflow with tapa::task().invoke(...)
Vitis HLS uses #pragma HLS dataflow to signal that a sequence of direct function calls should run as concurrent processes. TAPA replaces this with an explicit task graph:
-#pragma HLS dataflow
- load_input(in1, in1_stream, size);
- load_input(in2, in2_stream, size);
- compute_add(in1_stream, in2_stream, out_stream, size);
- store_result(out, out_stream, size);
+ tapa::task()
+ .invoke(load_input, in1, in1_stream, size)
+ .invoke(load_input, in2, in2_stream, size)
+ .invoke(compute_add, in1_stream, in2_stream, out_stream, size)
+ .invoke(store_result, out, out_stream, size);
All tasks in a tapa::task().invoke(...) chain run concurrently. The top-level function becomes pure orchestration — it declares streams, then hands everything off to child tasks.
Example 2: Dataflow-in-a-loop
The full before and after files are at example_2_before.cpp and example_2_after.cpp.
Vitis HLS permits #pragma HLS dataflow inside a for loop. Each iteration starts a new concurrent dataflow region:
// Vitis HLS: dataflow region restarts each iteration
size /= NUM_WORDS;
for (int i = 0; i < size; i++) {
#pragma HLS dataflow
load_input(in1, in1_stream, i);
load_input(in2, in2_stream, i);
compute_add(in1_stream, in2_stream, out_stream);
store_result(out, out_stream, i);
}
TAPA does not allow computation in upper-level tasks. A top-level TAPA task may only declare streams and invoke child tasks — it cannot contain loops or arithmetic. The solution is to move the loop into each child task:
// TAPA: loop lives in the child tasks
void load_input(tapa::mmap<hls::vector<uint32_t, NUM_WORDS>> in,
tapa::ostream<hls::vector<uint32_t, NUM_WORDS>>& inStream,
int size) {
size /= NUM_WORDS;
for (int i = 0; i < size; i++) {
#pragma HLS pipeline II = 1
inStream << in[i];
}
}
The top-level task then becomes:
void vadd(...) {
tapa::stream<...> in1_stream(...);
tapa::stream<...> in2_stream(...);
tapa::stream<...> out_stream(...);
tapa::task()
.invoke(load_input, in1, in1_stream, size)
.invoke(load_input, in2, in2_stream, size)
.invoke(compute_add, in1_stream, in2_stream, out_stream, size)
.invoke(store_result, out, out_stream, size);
}
The child tasks stream data to each other for the full duration; no synchronization is needed between iterations because each task has its own loop that runs from start to finish.
HLS-compat helpers for incremental migration
If you have a large existing codebase, TAPA provides tapa::hls::stream<T> as a drop-in replacement for hls::stream<T>. Unlike tapa::stream<T>, it uses effectively infinite depth in software simulation, so producers never block. This lets you keep direction-agnostic stream passing patterns while still running software simulation.
tapa::hls::stream<T> is available via #include <tapa.h> — no additional include is needed.
// Before (Vitis HLS):
hls::stream<float>& s
// After (TAPA compat, passes software simulation without depth tuning):
tapa::hls::stream<float>& s
Use this as a stepping stone: get software simulation passing with tapa::hls::stream, then replace with directional tapa::istream<T>& / tapa::ostream<T>& before shipping.
tapa::hls::stream synthesizes correctly — the generated RTL FIFO is identical to tapa::stream<T, N>. The reason to replace it before hardware build is that the infinite simulation depth hides backpressure bugs. Switching to directional streams with a tuned depth catches those bugs during software simulation, before they appear on hardware.
Next step: Lab 4: Custom RTL Modules
Lab 4: Custom RTL Modules
Goal: Replace a TAPA task with a hand-written RTL module while keeping a C++ behavior model for software simulation.
Prerequisites: Lab 1: Vector Addition and familiarity with the TAPA compile pipeline.
After this lab you will understand how to write a C++ behavior model for an ignored task, label it for RTL replacement, generate RTL port templates, provide custom RTL, and repack into a deployable XO.
When to use this
Use custom RTL modules when:
- An existing RTL implementation is available from a vendor IP catalog or a prior design, and reimplementing it in HLS would be wasteful.
- A task requires timing, area, or interface characteristics that HLS cannot produce.
- A task is too complex to express in synthesizable C++ and a direct RTL description is more practical.
Overview
The workflow has three parts:
- Write a C++ behavior model that correctly implements the task — this is what runs during software simulation. The code does not need to be synthesizable.
- Wrap the behavior model in a task annotated with
[[tapa::target("ignore")]]. TAPA compiles the rest of the design normally and generates RTL port template files for the ignored task instead of synthesizing it. - Provide the actual RTL implementation and repack the XO.
Example: using a vendor floating-point IP
Suppose you have a task that computes element-wise reciprocal square root and want to use Xilinx's Floating-Point IP core rather than the HLS-generated logic.
Step 1: Write the C++ behavior model
The behavior model lives in an ordinary task function. It will be called during software simulation and will never be synthesized, so it can use any C++ — standard library calls, dynamic containers, whatever is convenient and correct.
#include <cmath>
#include <tapa.h>
// Behavior model: runs during software simulation only.
// Uses std::sqrt — this does not need to be synthesizable.
void RsqrtCore(tapa::istream<float>& in, tapa::ostream<float>& out,
uint64_t n) {
for (uint64_t i = 0; i < n; ++i) {
float val = in.read();
out.write(1.0f / std::sqrt(val)); // stdlib call: fine for simulation
}
}
Step 2: Wrap with [[tapa::target("ignore")]]
Create a thin wrapper that invokes the behavior model. The [[tapa::target("ignore")]] attribute tells TAPA to skip synthesis of this wrapper and generate RTL port templates in its place. During software simulation the wrapper runs normally, which in turn calls RsqrtCore.
[[tapa::target("ignore")]] void Rsqrt(
tapa::istream<float>& in, tapa::ostream<float>& out, uint64_t n) {
tapa::task().invoke(RsqrtCore, in, out, n);
}
Only the wrapper needs the attribute. The behavior model (RsqrtCore) is a plain task function. Software simulation runs the wrapper as usual; synthesis skips it and generates port templates.
Step 3: Integrate into the top-level task
void Pipeline(tapa::mmap<const float> in, tapa::mmap<float> out, uint64_t n) {
tapa::stream<float> in_q("in");
tapa::stream<float> out_q("out");
tapa::task()
.invoke(Load, in, n, in_q)
.invoke(Rsqrt, in_q, out_q, n) // custom RTL replaces this
.invoke(Store, out_q, out, n);
}
Step 4: Compile to generate template files
tapa compile \
--top Pipeline \
--part-num xcu250-figd2104-2L-e \
--clock-period 3.33 \
-f pipeline.cpp \
-o work.out/pipeline.xo
Because Rsqrt is tagged ignore, TAPA generates RTL template files under work.out/template/. These templates define the exact port signatures the replacement RTL module must match.
Step 5: Implement the RTL
Write or adapt your RTL files so their port declarations match the generated templates. When you run tapa pack --custom-rtl in the next step, TAPA performs advisory port checking on .v files: it warns on mismatches but does not abort the build. Resolve any reported mismatches before moving to hardware.
Step 6: Repack with custom RTL
Two workflows are available depending on whether you are iterating on the RTL separately from the HLS compilation step.
Option A — Two-step workflow (compile once, iterate on RTL separately):
tapa pack \
-o work.out/pipeline.xo \
--custom-rtl ./rtl/
Option B — One-step workflow (compile and pack together):
tapa compile \
--top Pipeline \
--part-num xcu250-figd2104-2L-e \
--clock-period 3.33 \
-f pipeline.cpp \
-o work.out/pipeline.xo \
--custom-rtl ./rtl/
--custom-rtl accepts a file path or a directory. To include multiple paths, repeat the flag. .v files receive advisory port checking; other file types (for example .tcl) are packaged without format checking.
Software simulation with the behavior model
Because the behavior model is plain C++, software simulation works exactly as for any other TAPA design:
tapa g++ -- pipeline.cpp host.cpp -o pipeline
./pipeline
The behavior model does not need to match the RTL cycle-accurately — it only needs to produce the correct output values. Use this to validate host logic and data paths before RTL is ready.
The behavior model code can freely use unsynthesizable constructs: standard library functions, dynamic allocation, floating-point math, file I/O for golden output comparison, and so on. TAPA never attempts to synthesize it.
Validation
After repacking, run fast cosim to verify the custom RTL produces correct results before committing to a full bitstream build:
./pipeline --bitstream=work.out/pipeline.xo 1000
Catching functional bugs at cosim time is far cheaper than discovering them after hours of bitstream generation.
Full example
The complete working example is in tests/functional/custom-rtl in the TAPA repository.
Next step: Lab 5: Parallel RTL Emulation
Lab 5: Parallel RTL Emulation
Goal: Compile cycle-sensitive kernel modules to RTL and simulate them concurrently, reducing total cosim time while preserving cycle-accurate behavior where it matters.
Prerequisites: Lab 1: Vector Addition and Fast Hardware Simulation.
After this lab you will understand how to use tapa::executable to assign per-kernel RTL targets, compile each kernel to its own .xo, run the simulations in parallel, and prevent work-directory collisions between concurrent instances.
When to use this
RTL cosimulation gives you cycle-accurate behavior for the logic inside each kernel — pipeline depths, stall conditions, hazards, and II violations that software simulation cannot catch. However, not everything needs this level of fidelity:
- FIFOs between kernels are modeled as shared-memory queues, not cycle-accurate RTL. The latency across kernel boundaries is not representative of hardware.
- Memory accesses (mmap, async_mmap) are similarly abstracted; memory latency is not cycle-accurate.
Parallel RTL emulation is therefore most valuable for validating the cycle-sensitive internals of each kernel in isolation — compute pipelines, II, resource usage — rather than end-to-end timing across the full datapath.
Running one cosim process per kernel and launching them concurrently reduces wall-clock time compared to simulating everything in a single process or sequentially. Use it when:
- Your design contains multiple kernels with non-trivial compute pipelines that need cycle-accurate validation.
- You want to catch pipeline hazards, incorrect II, or RTL-level bugs in each kernel before the expensive bitstream link step.
- The kernels can be compiled and simulated independently.
Concept
In a standard single-kernel design, one top-level function compiles to one .xo and one cosim process validates it. In the parallel emulation pattern, several kernel functions compile independently and the host program runs one cosim process per kernel, all concurrently:
tapa::task()
.invoke(KernelA, tapa::executable(FLAGS_a_bitstream), ...) ──▶ cosim process A (cycle-accurate)
.invoke(KernelB, tapa::executable(FLAGS_b_bitstream), ...) ──▶ cosim process B (cycle-accurate)
.invoke(KernelC, tapa::executable(FLAGS_c_bitstream), ...) ──▶ cosim process C (cycle-accurate)
The streams connecting the processes are shared-memory FIFOs managed by the host runtime — latency-insensitive data transfer that lets each cosim process run at its own pace. Each kernel's internal cycle behavior is faithfully simulated; the inter-kernel communication is not.
Step 1: Write the kernels
Each kernel is a plain TAPA task function. The Cannon matrix-multiply example from tests/functional/parallel-emulation/ uses three kernel functions — Scatter, ProcElem, and Gather — all in one source file:
// Distribute matrix rows into per-PE stream arrays
void Scatter(tapa::mmap<const float> matrix,
tapa::ostreams<float, p * p>& block) { ... }
// Each PE computes its sub-matrix tile
void ProcElem(tapa::istream<float>& a_fifo, tapa::istream<float>& b_fifo,
tapa::ostream<float>& c_fifo, ...) { ... }
// Collect PE results into the output matrix
void Gather(tapa::mmap<float> matrix,
tapa::istreams<float, p * p>& block) { ... }
The top-level function declares the shared streams and assembles the task graph using tapa::executable:
DEFINE_string(scatter_bitstream, "", "XO for Scatter; empty = software simulation");
DEFINE_string(proc_elem_bitstream, "", "XO for ProcElem; empty = software simulation");
DEFINE_string(gather_bitstream, "", "XO for Gather; empty = software simulation");
void Cannon(tapa::mmap<const float> a_vec, tapa::mmap<const float> b_vec,
tapa::mmap<float> c_vec, uint64_t n) {
tapa::streams<float, p * p> a("a"), b("b"), c("c");
// ... inter-PE streams ...
tapa::task()
.invoke(Scatter, tapa::executable(FLAGS_scatter_bitstream), a_vec, a)
.invoke(Scatter, tapa::executable(FLAGS_scatter_bitstream), b_vec, b)
.invoke(ProcElem, tapa::executable(FLAGS_proc_elem_bitstream), a, b, c, ...)
// ... more ProcElem instances ...
.invoke(Gather, tapa::executable(FLAGS_gather_bitstream), c_vec, c);
}
When a FLAGS_*_bitstream flag is empty, that invocation falls back to software simulation automatically. This lets you bring up one kernel at a time.
Step 2: Compile each kernel separately
Each kernel function is compiled independently with its own tapa compile --top invocation:
tapa compile --top Scatter --part-num xcu250-figd2104-2L-e --clock-period 3.33 \
-f cannon.cpp -o scatter.xo
tapa compile --top ProcElem --part-num xcu250-figd2104-2L-e --clock-period 3.33 \
-f cannon.cpp -o proc-elem.xo
tapa compile --top Gather --part-num xcu250-figd2104-2L-e --clock-period 3.33 \
-f cannon.cpp -o gather.xo
The three compilations read the same source file but each targets a different top function. The outputs are independent .xo files with no knowledge of each other.
Step 3: Run parallel emulation
Pass all three .xo files to the host binary. All cosim processes start concurrently:
./cannon-host \
--scatter_bitstream=scatter.xo \
--proc_elem_bitstream=proc-elem.xo \
--gather_bitstream=gather.xo
Preventing work-directory collisions
By default each cosim process uses a temporary directory that is deleted at exit. When multiple processes share an explicit -cosim_work_dir, their intermediate files collide. Use -cosim_work_dir_parallel to give each process a unique subdirectory:
./cannon-host \
--scatter_bitstream=scatter.xo \
--proc_elem_bitstream=proc-elem.xo \
--gather_bitstream=gather.xo \
-cosim_work_dir ./cosim_work \
-cosim_work_dir_parallel
TAPA creates ./cosim_work/XXXXXX/ per instance so the simulations do not interfere.
Limiting concurrency
On memory-constrained machines, set TAPA_CONCURRENCY to cap the number of running cosim processes:
TAPA_CONCURRENCY=1 ./cannon-host \
--scatter_bitstream=scatter.xo \
--proc_elem_bitstream=proc-elem.xo \
--gather_bitstream=gather.xo
Even with TAPA_CONCURRENCY=1 the processes exchange data correctly through shared-memory FIFOs; they just run one at a time.
Step 4: Verify
A successful run prints the application's correctness result (e.g., PASS!) after all simulation processes finish. Diagnose failures the same way as single-kernel cosim: add -cosim_work_dir and -xsim_save_waveform to inspect per-kernel waveforms.
Further reading
Parallel RTL Emulation in the How-To Guides covers the full API reference, runtime flags, and additional invocation patterns.
Next step: Lab 6: Floorplan & DSE
Lab 6: Floorplan & DSE
Goal: Use TAPA's floorplan design space exploration (DSE) to achieve timing closure on multi-SLR FPGAs.
Prerequisites: Lab 2: High-Bandwidth Memory and familiarity with synthesis flags from Performance Tuning.
After this lab you will understand how to apply a floorplan solution to a compile step and, if the RapidStream optimization tool is available, how to generate floorplan solutions automatically.
Overview
Multi-SLR FPGAs (U250, U280, U55C, and similar) partition logic across physically separate silicon dies connected by SLR crossings. Long wires that cross SLR boundaries are a common source of timing failures. TAPA's floorplan tooling addresses this by:
- Assigning tasks to specific SLR regions.
- Automatically inserting pipeline registers on streams that cross SLR boundaries.
- Running a design space exploration to find placement configurations that stay within per-SLR resource limits.
Tool dependency
The floorplan generation step — which searches for optimal task-to-SLR assignments — requires rapidstream-tapaopt, an optimization tool historically provided by RapidStream Design Automation. This tool is no longer publicly accessible. If you hold a license, the full two-workflow process described below applies. If you do not, you can still apply a hand-written or externally provided floorplan.json directly using Workflow A Step 2, skipping the generation step.
Compiling a design with a floorplan applied — inserting pipeline registers and reorganizing the task hierarchy — works without rapidstream-tapaopt. Only the automated search for floorplan solutions requires the external tool.
Workflow A: Manual floorplan
Use this workflow when you want to inspect individual floorplan solutions before committing to a full compile, or when you already have a floorplan.json from another source.
Step 1: Generate floorplan solutions (requires rapidstream-tapaopt)
tapa generate-floorplan \
-f kernel.cpp \
-t kernel0 \
--device-config device_config.json \
--floorplan-config floorplan_config.json \
--clock-period 3.00 \
--part-num xcu55c-fsvh2892-2L-e
This runs the DSE and writes one or more floorplan_N.json files to the working directory. Each file represents a distinct placement solution.
Step 2: Compile with a chosen solution
tapa compile \
-f kernel.cpp \
-t kernel0 \
--floorplan-path floorplan_0.json \
--clock-period 3.00 \
--part-num xcu55c-fsvh2892-2L-e \
--flatten-hierarchy
--floorplan-path requires --flatten-hierarchy. Omitting --flatten-hierarchy will cause the compile to fail.
TAPA reorganizes the task hierarchy according to the chosen floorplan and inserts pipeline registers at all SLR-crossing streams. This step does not require rapidstream-tapaopt.
Workflow B: Automated DSE (requires rapidstream-tapaopt)
Use this workflow to generate and compile all floorplan solutions in one step without manual inspection between them.
tapa compile-with-floorplan-dse \
-f kernel.cpp \
-t kernel0 \
--device-config device_config.json \
--floorplan-config floorplan_config.json \
--clock-period 3.00 \
--part-num xcu55c-fsvh2892-2L-e
compile-with-floorplan-dse runs the DSE, then compiles and applies pipeline insertion for each floorplan solution it generates. Use this when you want to produce all candidates in one automated run and pick the best result based on downstream timing reports.
Floorplan config format
The --floorplan-config JSON controls how the DSE searches for placement solutions. A representative example:
{
"max_seconds": 1000,
"dse_range_min": 0.7,
"dse_range_max": 0.88,
"partition_strategy": "flat",
"cpp_arg_pre_assignments": {
"a": "SLOT_X1Y0:SLOT_X1Y0",
"b_0": "SLOT_X2Y0:SLOT_X2Y0"
},
"sys_port_pre_assignments": {
"ap_clk": "SLOT_X2Y0:SLOT_X2Y0"
}
}
Key fields:
dse_range_min/dse_range_max— The acceptable per-SLR resource utilization range (as a fraction of 1.0). The DSE only keeps placements where every SLR falls within this band.cpp_arg_pre_assignments— Forces specific top-function kernel arguments to specific SLR slots. Values areSLOT_XmYn:SLOT_XmYnstrings. Array arguments can be matched with regex patterns (for example"c_.*"matchesc_0,c_1, etc.).sys_port_pre_assignments— Forces Verilog system ports (clock, reset, AXI control) to specific slots. Regex patterns are supported here as well.
The full set of available fields (including grouping_constraints, slot_to_rtype_to_min_limit, and others) is documented in the RapidStream floorplan configuration reference.
Further reading
Performance Tuning covers the --gen-ab-graph and --gen-graphir flags, which produce visual and structural representations of the task graph useful for diagnosing floorplan decisions.
Next step: Examples Catalog
Examples Catalog
The TAPA repository includes two sets of example designs. Small self-contained examples live under tests/apps/. Larger benchmarks live under tests/regression/.
Small examples
| Example | Problem type | Key TAPA feature | Location |
|---|---|---|---|
| vadd | Vector addition | Basic streams + mmap | tests/apps/vadd |
| bandwidth | Memory bandwidth benchmark | async_mmap, 32 HBM channels | tests/apps/bandwidth |
| network | Packet switching | peek, detached tasks, hierarchical tasks | tests/apps/network |
| cannon | Cannon's matrix multiply | 2D stream arrays, systolic | tests/apps/cannon |
| jacobi | Stencil computation | End-of-transmission (close()) | tests/apps/jacobi |
Published benchmarks
| Example | Problem type | Key feature | Published in |
|---|---|---|---|
| autosa mm/10x13 | Matrix multiplication | AutoSA-generated systolic (90% U55C LUT) | — |
| callipepla | Conjugate gradient | 26 HBM channels | FPGA'23 |
| cnn | CNN systolic array | Multi-SLR | FPGA'21 |
| lu_decompose | LU systolic array | Multi-SLR | FPGA'21 |
| hbm-bandwidth | HBM bandwidth profiler | async_mmap, all 32 channels | — |
| hbm-bandwidth-1-ch | HBM bandwidth (1 channel) | Minimal async_mmap | — |
| serpens | Sparse SpMV | Multiple HBM channels, scalable parallelism | DAC'22 |
| spmm | Sparse SpMM | HBM streams | FPGA'22 |
| spmv-hisparse-mmap | Sparse SpMV (HiSparse) | mmap-based SpMV | FPGA'22 |
| knn | K-nearest-neighbor | FPT accelerator | FPT'20 |
| page_rank | Page Rank | FCCM accelerator | FCCM'21 |
The tests/regression/ directory is under active development; new designs are added regularly. Check the repository for the latest list.
Next step: Common Errors
Common Errors
Symptom descriptions and fixes for the most common compile-time and runtime errors.
When to use this page: When tapa g++ or tapa compile reports an error, or when software simulation crashes or produces wrong output.
Stream passed by value
Symptom: Compile error mentioning a deleted copy constructor, or that istream/ostream is not CopyConstructible.
Cause: The stream parameter is declared without &. Streams are non-copyable objects — they represent live communication channels between tasks, not data values.
Fix: Always pass streams by reference.
// Wrong
void Task(tapa::istream<int> in, tapa::ostream<int> out) { ... }
// Right
void Task(tapa::istream<int>& in, tapa::ostream<int>& out) { ... }
mmap passed by reference
Symptom: Compile error about a type mismatch or an unexpected & on an mmap parameter.
Cause: tapa::mmap<T> is essentially a pointer to a memory region and must be passed by value, not by reference.
Fix: Remove the & from mmap parameters.
// Wrong
void Task(tapa::mmap<int>& mem) { ... }
// Right
void Task(tapa::mmap<int> mem) { ... }
async_mmap passed by value
Symptom: Passing async_mmap by value is deprecated and may produce a warning or error depending on the TAPA version.
Cause: tapa::async_mmap<T> is a set of streams that controls memory access. Like regular streams, it must be passed by reference.
Fix: Always pass async_mmap by reference.
// Wrong
void Task(tapa::async_mmap<int> mem) { ... }
// Right
void Task(tapa::async_mmap<int>& mem) { ... }
Computation in upper-level task body
Symptom: tapacc reports an error about computation in an upper-level task, or the design fails synthesis unexpectedly.
Cause: Upper-level tasks (tasks that invoke other tasks) may only contain stream declarations and .invoke() chains. Any arithmetic, conditionals, or other function calls belong in leaf tasks. For example, computing n * 2 directly in TopLevel is not allowed:
// Wrong
void TopLevel(int n, tapa::mmap<int> mem) {
tapa::stream<int> s("s");
tapa::task()
.invoke(Task1, s, mem, n * 2)
.invoke(Task2, s, n * 2);
}
Fix: Move the computation into the child task that uses the result.
// Right
void Task2(tapa::istream<int>& in, int n) {
n = n * 2;
// use n ...
}
void TopLevel(int n, tapa::mmap<int> mem) {
tapa::stream<int> s("s");
tapa::task()
.invoke(Task1, s, mem, n)
.invoke(Task2, s, n);
}
Stream array declared as stream[] instead of streams<>
Symptom: Compile error or incorrect behavior when defining or passing arrays of streams.
Cause: tapa::stream<T> arr[N] is not copyable or movable in the way TAPA expects. Arrays of streams must use the dedicated tapa::streams<T, N> type.
Fix: Use tapa::streams<T, N> for stream arrays, and use .invoke with a count to distribute elements rather than indexing manually.
// Wrong
tapa::stream<int> data_q[4];
tapa::task().invoke(Task, data_q[0], mem[0])
.invoke(Task, data_q[1], mem[1]);
// Right
tapa::streams<int, 4> data_q;
tapa::mmaps<int, 4> mem;
tapa::task().invoke<tapa::join, 4>(Task, data_q, mem);
tapac not found
Symptom: Shell reports command not found: tapac.
Cause: tapac was the old command name. It has been replaced by tapa compile.
Fix: Replace tapac with tapa compile. Most flags carry over directly.
# Old
tapac --top VecAdd -f vadd.cpp -o vadd.xo ...
# New
tapa compile --top VecAdd -f vadd.cpp -o vadd.xo ...
Run tapa compile --help for the full option list.
Tasks not defined in the same compilation unit as the top-level function
Symptom: tapacc cannot find a task function, or a link error occurs for a task symbol.
Cause: TAPA requires all task functions to be visible in the same compilation unit as the top-level function. Placing tasks in separate .cpp files means the compiler never sees them together.
Fix: Define tasks in header files and #include them in the main kernel file.
// task1.hpp
void Task1(/* ... */) { /* ... */ }
// task2.hpp
void Task2(/* ... */) { /* ... */ }
// top_level.cpp
#include "task1.hpp"
#include "task2.hpp"
void TopLevel(/* ... */) {
tapa::task().invoke(Task1, /* ... */).invoke(Task2, /* ... */);
}
Static variables behave differently in simulation vs hardware
Symptom: Software simulation produces different output than hardware execution.
Cause: Static variables are shared across all invocations within a single simulation process. In hardware, each task instance synthesizes its own independent copy of the variable.
For example:
void Task() {
static int counter = 0;
counter++;
}
tapa::task().invoke(Task).invoke(Task);
In software simulation counter reaches 2 (one shared variable, incremented twice). In hardware each instance has its own counter, so both instances end at 1.
Fix: Avoid static variables inside tasks. Pass state between tasks using stream or mmap arguments.
If a parameter type mismatch error is confusing, work through this checklist:
- Does the number of arguments at the call site match the task signature?
- Are stream directions correct —
istreamfor reads,ostreamfor writes? - Are passing conventions correct — streams and
async_mmapby reference,mmapby value? - Is the parameter order the same between the call site and the task definition?
See also: Deadlocks & Hangs | Cosimulation Issues
Deadlocks & Hangs
When to use this page: When software simulation or fast cosim hangs without producing output, or terminates without printing results.
tapa::stream enforces the declared depth in both software simulation and fast cosim/RTL. A blocking write() on a full stream yields the current coroutine and retries until space is available — so shallow stream depth can deadlock in software simulation too. The exception is tapa::hls::stream (the Vitis HLS compatibility alias), which uses effectively infinite depth in software simulation.
Diagnosis checklist
Work through the following causes in order — they are listed from most to least common.
1. Stream depth too shallow
A producer fills the FIFO and blocks waiting for the consumer to drain it. If the consumer is itself waiting for data from another stream, neither task can make progress and the simulation hangs.
Fix: Increase the stream depth by providing the second template argument.
// Default depth of 2 — may deadlock under backpressure
tapa::stream<int> s("s");
// Larger depth gives the producer room to run ahead
tapa::stream<int, 32> s("s");
Start at the default depth of 2 and increase to 16 or 32 when you observe backpressure. In hardware, deeper FIFOs consume more BRAM, so avoid over-provisioning depth once correctness is confirmed.
2. Missing loop termination or element count mismatch
A writer sends fewer elements than the reader expects. The reader blocks indefinitely waiting for data that never arrives.
Fix: Verify that every producer sends exactly as many elements as the corresponding consumer reads. A common mistake is an off-by-one in loop bounds or a conditional write that skips elements.
3. Circular dependency between tasks
Task A waits for output from Task B before it can write to Task B's input. Task B waits for input from Task A before it can produce output. Neither can make progress.
Fix: Redesign the data flow to eliminate the cycle. If a feedback path is genuinely required, use try_read / try_write so that a task can make progress even when the channel is empty or full.
4. async_mmap write responses not drained
The write_resp FIFO fills up. Once full, the hardware stops accepting new write addresses and the kernel stalls.
Fix: Always drain write_resp inside the same pipelined loop that issues writes. Use non-blocking try_write / try_read so both issue and drain progress every cycle:
void WriteTask(tapa::async_mmap<int>& mem, tapa::istream<int>& data, int n) {
for (int64_t i_req = 0, i_resp = 0; i_resp < n;) {
#pragma HLS pipeline II=1
if (i_req < n && !data.empty() &&
!mem.write_addr.full() && !mem.write_data.full()) {
mem.write_addr.try_write(i_req);
mem.write_data.try_write(data.read(nullptr));
++i_req;
}
uint8_t ack;
if (mem.write_resp.try_read(ack)) {
i_resp += unsigned(ack) + 1; // ack encodes burst_length - 1
}
}
}
Splitting writes and response drain into separate loops risks deadlock: if write_resp fills before all writes are issued, the hardware stops accepting write addresses and the first loop never completes.
Isolation strategy
Run with TAPA_CONCURRENCY=1 to serialize all tasks into a single coroutine thread. This makes a hang deterministic and easier to reproduce and attach a debugger to.
TAPA_CONCURRENCY=1 ./vadd
If the hang disappears at concurrency 1 but reappears at the default concurrency, the issue is a scheduling race rather than a structural deadlock. Look for assumptions about task ordering that do not hold under concurrent scheduling.
Finding the blocked task
Attach GDB to the hung process to identify which task is stuck and on which operation.
gdb ./vadd
Let the binary run until it hangs, then interrupt it:
^C
(gdb) info threads
(gdb) thread apply all bt
The backtrace will show the call stack for every coroutine. Look for a frame inside a read or write call on a TAPA stream — the stream name in that frame identifies where flow has stopped.
Waveform debugging in fast cosim
Run cosim with a persistent work directory and waveform capture enabled so you can inspect the simulation state after a hang.
./vadd --bitstream=vadd.xo \
-cosim_work_dir ./cosim_work \
-xsim_save_waveform \
1000
If the simulation hangs, press Ctrl-C to terminate it, then open the waveform in Vivado:
vivado -mode gui -source ./cosim_work/output/run/run_cosim.tcl
Inspect the AXI and stream signals to identify which channel is stalled. A valid signal held high with a ready signal held low indicates backpressure; a ready signal high with no valid indicates the producer has stopped sending.
Set TAPA_STREAM_LOG_DIR=/tmp/stream_logs before running. TAPA logs each value written to a stream into a file under that directory:
TAPA_STREAM_LOG_DIR=/tmp/stream_logs ./vadd
Each named stream gets its own log file. A stream with an empty or truncated log identifies where data flow stops.
Stream depth tuning reference
| Symptom | Starting depth | Suggested increase |
|---|---|---|
| Hang with 2 tasks in a pipeline | 2 (default) | 16 |
| Hang with deep pipeline (>4 stages) | 16 | 32–64 |
| Correctness issue, no hang | Any | Try 2 first to expose races |
Increasing depth lets producers run further ahead of consumers and resolves backpressure-induced deadlocks. In hardware, each entry in a stream FIFO consumes flip-flops or BRAM. Once the design is functionally correct, profile resource usage and reduce depths where headroom allows.
See also: Common Errors | Cosimulation Issues
Cosimulation Issues
When to use this page: When --bitstream=vadd.xo (fast cosim) runs differently from software simulation, or when cosim produces xsim or Verilator errors.
Fast cosim vs software simulation mismatches
If fast cosim fails (FAIL! or hangs) but software simulation passes, the most common causes are:
-
Non-deterministic scheduling can expose races not visible in software simulation. Software simulation uses coroutine scheduling that runs tasks cooperatively; RTL runs tasks truly in parallel. Races that are hidden by cooperative scheduling in software simulation may surface as failures in fast cosim. Fix: remove any assumptions about task ordering that are not enforced by stream synchronization.
-
Blocking
async_mmapoperations inside pipelined loops. A blocking call inside a pipelined loop can stall the pipeline in RTL in ways that software simulation does not model. Fix: use non-blocking reads/writes and manually handle the response FIFOs, or switch totapa::mmapto simplify the memory access model while debugging. -
Non-deterministic task scheduling. Software simulation uses coroutine scheduling that may resolve races differently than RTL. If results depend on the relative timing of two tasks, they may differ between simulation and RTL.
Fast cosim models DRAM with a simplified functional model. Throughput and latency numbers from fast cosim are not representative of on-board performance. Use fast cosim only to verify functional correctness.
HBM cross-channel access limitation
Fast cosimulation does not support cross-channel access for HBM. Each AXI interface can only access one HBM channel. Designs that require cross-channel HBM access must be validated on hardware rather than in fast cosim.
If your design uses multiple HBM pseudo-channels and the fast cosim result does not match software simulation, verify that no single AXI port accesses more than one HBM channel.
xsim issues
xsim not found or Vivado not found
xsim is part of the Vivado installation. Source the Vivado environment script before running cosim:
source /opt/Xilinx/Vivado/2022.1/settings64.sh
./vadd --bitstream=vadd.xo ...
Adjust the path to match your Vivado installation and version.
xsim hangs at elaboration
Check that the .xo file was produced by a successful tapa compile run. A partial or corrupt .xo (from a failed or interrupted compilation) can cause elaboration to hang silently. Re-run tapa compile from scratch and verify it exits with status 0 before running cosim.
Segfault inside xsim
This is typically a Vivado bug. Try switching to a different Vitis/Vivado version. Versions tested by the TAPA CI pipeline are listed in the TAPA repository's CI configuration.
Verilator issues
verilator not found
Install Verilator from your package manager or build from source:
# Debian/Ubuntu
sudo apt install verilator
Verilator compilation error (Verilog parsing error)
TAPA generates Verilog targeting recent Verilator versions. If you see Verilog parsing errors, update Verilator to the version used in TAPA's CI pipeline.
No waveform support with Verilator
Verilator simulation does not support waveform capture via the Vivado GUI. If you need waveform debugging, use xsim and pass -xsim_save_waveform as described below.
Cosim produces wrong output (FAIL!) but xsim does not hang
Run with waveform capture and a persistent work directory so you can inspect the simulation after it completes:
./vadd --bitstream=vadd.xo \
-cosim_work_dir ./cosim_work \
-xsim_save_waveform \
1000
Then open the waveform in Vivado GUI:
vivado -mode gui -source ./cosim_work/output/run/run_cosim.tcl
In the waveform viewer, add the AXI memory interface signals and compare the expected vs actual data on each transaction. Look for read data that does not match what the host wrote, or write transactions that target unexpected addresses.
Stream diagnostics
The DPI runtime reports stream progress periodically when a stream stalls (empty on read or full on write). These messages appear on stderr and include the port name and queue state:
frt-dpi: progress[a_fifo_s]: read_ok=16 read_empty=40M write_ok=0 write_full=0 q_head=8 q_tail=8
| Field | Meaning |
|---|---|
progress[port] | The port that triggered the report (the one currently stalling). |
read_ok | Total successful reads across all ports in this process. |
read_empty | Total empty-read attempts (queue had no data). |
write_ok | Total successful writes across all ports. |
write_full | Total full-write attempts (queue had no space). |
q_head / q_tail | Shared-memory queue counters for the stalling port. q_tail = elements pushed by the producer; q_head = elements popped by the consumer. q_head == q_tail means the queue is empty. |
Enabling verbose per-element logging
Set the FRT_STREAM_DEBUG environment variable to log every successful stream read and write:
FRT_STREAM_DEBUG=1 ./vadd --bitstream=vadd.xo 1000
Interpreting stall patterns
q_tail=0on a consumer port: the producer never wrote to this stream. Check that the producer's xsim started and that stream arguments are bound correctly.q_head == q_tailbutread_ok < expected: all produced elements were consumed but not enough were produced. The producer may have exited before flushing all writes.write_fullgrowing: the consumer is not draining fast enough. Check for deadlocks or increaseTAPA_CONCURRENCY.
Always pass software simulation before running fast cosim. Software simulation runs faster and catches logic bugs in C++. Fast cosim catches RTL bugs introduced by synthesis and scheduling. Skipping software simulation wastes cosim time on bugs that are much faster to fix at the C++ level.
See also: Common Errors | Deadlocks & Hangs
CLI Commands
Reference for all tapa CLI subcommands. For task-oriented guides, see Build and Run and the other How-To pages. The general invocation form is:
tapa [global options] <subcommand> [subcommand options]
tapa compile is a shortcut that runs tapa analyze, tapa synth, and tapa pack in sequence in a single command. When using the individual subcommands, pass --work-dir as a global flag before the subcommand name: tapa --work-dir DIR <subcommand>.
Global Options
These options must appear before the subcommand name.
| Flag | Description |
|---|---|
--work-dir DIR / -w DIR | Working directory for intermediate artifacts (default: ./work.out/). |
--verbose / -v | Increase logging verbosity. Repeatable (e.g., -vv). |
--quiet / -q | Decrease logging verbosity. |
--remote-host user@host[:port] | Remote Linux host where vendor tools run. |
--remote-key-file PATH | SSH private key file for authenticating to the remote host. |
--remote-xilinx-settings PATH | Path to settings64.sh on the remote host. |
--remote-ssh-control-dir DIR | Local directory for SSH multiplex control sockets. |
--remote-ssh-control-persist DURATION | How long the SSH master socket stays alive (default: 30m). |
--remote-disable-ssh-mux | Disable SSH connection multiplexing. |
tapa compile
Run the full compilation pipeline (analyze → synth → pack) in a single command.
Required flags
| Flag | Description |
|---|---|
--top FUNCTION / -t FUNCTION | Top-level task function name. |
-f FILE | Kernel source file. |
-o OUTPUT.xo | Output XO file path. |
Optional flags
| Flag | Description |
|---|---|
--part-num PART | Target FPGA part number (e.g., xcu250-figd2104-2L-e). |
--platform PLATFORM | Vitis platform string. Alternative to --part-num. |
--clock-period NS | Target clock period in nanoseconds. |
--target {xilinx-vitis,xilinx-hls,xilinx-aie} | Output target (default: xilinx-vitis). xilinx-aie is experimental. |
-j N | Number of parallel HLS jobs. |
--custom-rtl PATH | Custom RTL file or directory to include in the XO. |
Example
tapa compile \
--top VecAdd \
--part-num xcu250-figd2104-2L-e \
--clock-period 3.33 \
-f vadd.cpp \
-o vadd.xo
tapa analyze
Parse C++ source and extract the task graph to a JSON file in the work directory. This stage always runs locally and does not require vendor tools.
Required flags
| Flag | Description |
|---|---|
--top FUNCTION / -t FUNCTION | Top-level task function name. |
-f FILE | Kernel source file. |
Optional flags
| Flag | Description |
|---|---|
--target {xilinx-vitis,xilinx-hls,xilinx-aie} | Output target (default: xilinx-vitis). Controls the synthesis flow. xilinx-aie is experimental. |
Example
tapa --work-dir work.out analyze --top VecAdd -f vadd.cpp
tapa synth
Run Vitis HLS on each task to produce per-task Verilog RTL. Reads the task graph produced by tapa analyze from the work directory. Can run on a remote host via --remote-host.
Required flags
| Flag | Description |
|---|---|
--part-num PART | Target FPGA part number. Required if --platform is not set. |
--platform PLATFORM | Vitis platform string. Required if --part-num is not set. |
Optional flags
| Flag | Description |
|---|---|
--clock-period NS | Target clock period in nanoseconds. Can be derived from --platform if not set explicitly. |
-j N | Number of parallel HLS jobs (default: number of physical CPU cores). |
--enable-synth-util | Run post-HLS RTL synthesis to produce per-task resource utilization estimates. |
--nonpipeline-fifos JSON | JSON specification of FIFOs for which pipeline registers should be suppressed. |
--gen-ab-graph | Generate ab_graph.json for AutoBridge/RapidStream floorplanning. |
--gen-graphir | Generate graphir.json for RapidStream. |
--floorplan-config PATH | Path to the floorplan configuration file. Used with --gen-ab-graph or --gen-graphir. |
--device-config PATH | Path to the device configuration file. Used with --gen-graphir. |
--floorplan-path PATH | Path to an existing floorplan file to apply during synthesis. Requires --flatten-hierarchy. |
Example
tapa --work-dir work.out synth \
--part-num xcu250-figd2104-2L-e \
--clock-period 3.33 \
-j 4
tapa pack
Package per-task RTL from the work directory into a single output artifact. For the default xilinx-vitis target this produces an XO file; for other targets a ZIP file is produced. Reads RTL produced by tapa synth.
Optional flags
| Flag | Description |
|---|---|
-o OUTPUT | Output file path (default: work.xo for the Vitis target, work.zip for other targets). |
--custom-rtl PATH | Custom RTL file or directory to include in the XO. |
Example
tapa --work-dir work.out pack -o vadd.xo
tapa g++
Compile TAPA host and kernel C++ for software simulation. This is a wrapper around g++ that automatically sets the required TAPA include paths and link flags. All arguments after -- are forwarded directly to g++.
Example
tapa g++ -- vadd.cpp vadd-host.cpp -o vadd
See Software Simulation for how to run the resulting executable.
tapa version
Print the installed TAPA version.
tapa version
Runtime Flags
This page covers environment variables and host executable flags that control TAPA behavior at runtime. These apply after compilation, during software simulation or fast hardware cosimulation.
Environment Variables
These variables are read by the host executable at startup.
| Variable | Default | Description |
|---|---|---|
TAPA_CONCURRENCY | Number of CPU cores | Number of parallel coroutine threads used by software simulation. Set to 1 for single-threaded, more reproducible simulation runs. Has no effect on HLS compilation parallelism (-j). |
TAPA_STREAM_LOG_DIR | (unset — logging disabled) | Directory for stream transfer logs. When set, TAPA writes one log file per named stream recording each value written to that stream. Useful for tracing data corruption during software simulation. |
FRT_STREAM_DEBUG | (unset) | When set, log every successful stream read and write in the DPI layer. Produces high-volume output; use only for targeted debugging. |
FRT_COSIM_YIELD | 1 (enabled) | When enabled, the DPI layer calls thread::yield_now() on empty reads or full writes. Disable with 0 to busy-wait instead. |
FRT_XSIM_LEGACY | 0 | Set to 1 to use the legacy xelab command-line format for older Vivado versions. |
FRT_XOCL_BDF | (unset) | PCIe Bus:Device:Function for XRT/OpenCL device selection. Equivalent to the -xocl_bdf gflag. |
Example: reproducible single-threaded simulation
TAPA_CONCURRENCY=1 ./vadd
Example: enable stream logging
TAPA_STREAM_LOG_DIR=/tmp/stream-logs ./vadd
See Software Simulation for more on stream logging and debugging.
Host Executable Flags (Fast Cosim)
When the host executable is invoked with --bitstream=vadd.xo, it runs fast hardware cosimulation instead of software simulation. The following flags control cosim behavior. They are passed directly on the host executable command line.
These flags use single-dash prefix (e.g., -cosim_work_dir) because they are parsed by the host executable via gflags.
| Flag | Description |
|---|---|
-cosim_executable <path> | Deprecated. Fast cosim now runs in-process via libfrt; this flag is ignored. |
-xsim_part_num <part> | Target FPGA part number for simulation (e.g., xcu280-fsvh2892-2L-e). |
-cosim_work_dir <dir> | Persistent working directory for simulation artifacts. Without this flag, a temporary directory is used and deleted after the run. |
-xsim_save_waveform | Save simulation waveforms to a .wdb file in the work directory. Pair with -cosim_work_dir; without it, the temporary directory and all waveforms are deleted after the run. |
-xsim_start_gui | Open the Vivado GUI for interactive debugging during simulation. |
-cosim_simulator <backend> | Simulator backend: xsim (default, Linux only, requires Vivado) or verilator (cross-platform, no Vivado required). |
-cosim_setup_only | Run simulation setup only, then stop before executing the simulation. Useful for inspecting generated simulation files before committing to a full run. |
-cosim_resume_from_post_sim | Skip re-running the simulation and jump directly to post-simulation checks. Use after a completed simulation to re-run checks without re-simulating. |
-cosim_work_dir_parallel | Create a unique subdirectory per instance when running multiple concurrent simulations, preventing work directory collisions. |
Example: save waveforms from a named work directory
./vadd --bitstream vadd.xo \
-cosim_work_dir ./cosim_work \
-xsim_save_waveform \
1000
Example: staged workflow (setup then resume)
# Step 1: set up and inspect the simulation environment
./vadd --bitstream vadd.xo -cosim_work_dir ./cosim_work -cosim_setup_only 1000
# Step 2: run post-simulation checks without re-simulating
./vadd --bitstream vadd.xo -cosim_work_dir ./cosim_work -cosim_resume_from_post_sim 1000
For a full walkthrough of fast cosim workflows, see Fast Hardware Simulation.
C++ API
This page documents the TAPA C++ library (#include <tapa.h>). Types and functions live in the tapa namespace unless noted otherwise.
Task Invocation
tapa::task
The task hierarchy builder. An upper-level task constructs a tapa::task and chains .invoke() calls on it. The tapa::task destructor waits for all joined child instances to finish before returning.
struct task {
// Invoke func with the given arguments using the default join mode.
template <typename Func, typename... Args>
task& invoke(Func&& func, Args&&... args);
// Invoke func with an explicit mode (tapa::join or tapa::detach).
template <internal::InvokeMode mode, typename Func, typename... Args>
task& invoke(Func&& func, Args&&... args);
// Invoke func N times with the given mode.
template <internal::InvokeMode mode, int N, typename Func, typename... Args>
task& invoke(Func&& func, Args&&... args);
};
Invoke modes:
| Mode | Behavior |
|---|---|
tapa::join (default) | The task runs concurrently with siblings; the parent waits for it to finish before returning. |
tapa::detach | Fire-and-forget; the parent does not wait for the task to finish. Use with care — the parent may return before the detached task completes. |
Example:
void Top(tapa::istream<float>& in, tapa::ostream<float>& out, int n) {
tapa::task()
.invoke(LoadData, in, n)
.invoke<tapa::detach>(MonitorTask, n)
.invoke(StoreData, out, n);
}
tapa::seq
A sequential index generator. When tapa::seq{} is passed as an argument to .invoke() with a repeat count N, each invocation receives a unique integer (0, 1, 2, …, N−1). Use this to distribute indexed work across task instances, such as assigning each instance its slice of a stream array.
tapa::streams<float, 4> channels;
tapa::task().invoke<tapa::join, 4>(Worker, channels, tapa::seq{});
// Worker instance 0 gets channel[0], instance 1 gets channel[1], etc.
tapa::executable
Wraps a path to an XO or bitstream file for use in .invoke(). When an executable is passed as the second argument to .invoke(), the task runs on hardware (via FRT) instead of in software simulation.
class executable {
public:
explicit executable(std::string path);
};
Usage:
tapa::task().invoke(MyKernel, tapa::executable("my_kernel.xo"), arg1, arg2);
Streams
Streams are the fundamental inter-task communication primitive. Each stream is a fixed-depth FIFO. Blocking operations stall until data or space is available; non-blocking operations return immediately.
tapa::stream<T, Depth>
Bidirectional FIFO that owns the underlying storage. Declared inside an upper-level task and passed to child tasks as istream<T>& (read end) or ostream<T>& (write end). The default depth is 2.
template <typename T, uint64_t Depth = 2>
class stream;
tapa::istream<T>
Read-only view of a stream. Always passed by reference in task signatures: tapa::istream<T>&.
| Method | Blocking | Destructive | Description |
|---|---|---|---|
read() | yes | yes | Blocks until an element is available, then returns it. |
read(bool& ok) | no | yes | Non-blocking read; sets ok to true if an element was consumed. |
try_read(T& val) | no | yes | Non-blocking read; returns true and writes to val if successful. |
peek(bool& ok) | no | no | Returns the next element without consuming it; sets ok. |
try_peek(T& val) | no | no | Non-blocking peek; returns true if data was available. |
empty() | no | no | Returns true if the stream contains no elements. |
eot(bool& ok) | no | no | Returns true if the head element is an end-of-transaction marker. |
open() | yes | yes | Blocks until an EoT marker arrives, then consumes it. Used to receive stream closure. |
try_open() | no | yes | Non-blocking variant of open(); returns true if EoT was consumed. |
tapa::ostream<T>
Write-only view of a stream. Always passed by reference in task signatures: tapa::ostream<T>&.
| Method | Blocking | Destructive | Description |
|---|---|---|---|
write(const T& val) | yes | yes | Blocks until space is available, then writes val. |
try_write(const T& val) | no | yes | Non-blocking write; returns true if the element was written. |
full() | no | no | Returns true if the stream is full. |
close() | yes | yes | Writes an end-of-transaction marker; blocks until space is available. |
try_close() | no | yes | Non-blocking variant of close(); returns true if the EoT was written. |
tapa::streams<T, N, Depth>
Array of N streams of type T, each with depth Depth. Declared in an upper-level task and unpacked by index when passed to child tasks.
tapa::istreams<T, N> / tapa::ostreams<T, N>
Array of N read-only or write-only stream views. Always passed by reference in task signatures.
All stream types (istream, ostream, istreams, ostreams) must be passed by reference in task signatures. Passing by value is a compile error.
Memory (mmap)
tapa::mmap<T>
A pointer-like handle for synchronous bulk memory access. Backed by a contiguous host allocation. In a task signature, tapa::mmap<T> is passed by value.
template <typename T>
class mmap {
public:
explicit mmap(T* ptr);
mmap(T* ptr, uint64_t size);
template <typename Container>
explicit mmap(Container& container); // accepts std::vector etc.
T* data() const;
uint64_t size() const;
template <uint64_t N>
mmap<vec_t<T, N>> vectorized() const; // reinterpret as wider element type
template <typename U>
mmap<U> reinterpret() const; // reinterpret element type
};
tapa::async_mmap<T>
Decoupled memory access type. Instead of blocking on each memory operation, the kernel issues read/write requests and collects responses through five FIFO channels. This allows the kernel to pipeline memory operations. Passed by reference in task signatures: tapa::async_mmap<T>&.
See async_mmap channels below for channel details.
tapa::mmaps<T, N>
Array of N tapa::mmap<T> regions. Passed by value as a single argument and unpacked by the framework one region per child invocation.
template <typename T, uint64_t N>
class mmaps;
Directional mmap wrappers (host-side only)
Used in the top-level tapa::invoke() call to express direction hints. The kernel task signature uses plain tapa::mmap<T> or tapa::mmaps<T, N>.
| Wrapper | Direction |
|---|---|
tapa::read_only_mmap<T> | Host writes, kernel reads |
tapa::write_only_mmap<T> | Kernel writes, host reads |
tapa::read_write_mmap<T> | Both read and write |
tapa::placeholder_mmap<T> | No direction hint |
tapa::read_only_mmaps<T, N> | Array variant of read_only_mmap |
tapa::write_only_mmaps<T, N> | Array variant of write_only_mmap |
tapa::read_write_mmaps<T, N> | Array variant of read_write_mmap |
tapa::aligned_allocator<T>
STL-compatible allocator that returns page-aligned memory suitable for DMA transfers. Use this with std::vector when allocating host buffers that will be passed to a kernel.
std::vector<float, tapa::aligned_allocator<float>> buf(n);
tapa::invoke(MyKernel, bitstream, tapa::read_only_mmap<float>(buf), n);
async_mmap Channels
tapa::async_mmap<T> exposes five public member channels. The kernel writes addresses to the request channels and reads results from the response channels. All channel operations are non-blocking where prefixed with try_.
| Channel | Type | Direction | Description |
|---|---|---|---|
read_addr | ostream<int64_t> | kernel → memory | Write an element index to request a read. The framework converts the index to a byte offset internally. |
read_data | istream<T> | memory → kernel | Read the data returned by a previously issued read request. |
write_addr | ostream<int64_t> | kernel → memory | Write an element index to request a write. |
write_data | ostream<T> | kernel → memory | Write the data to be written at the requested address. |
write_resp | istream<uint8_t> | memory → kernel | Drain write-completion acknowledgements. Each response value encodes burst_length - 1 (i.e., a value of 0 means one write completed, 255 means 256 writes completed). |
The kernel must drain write_resp to avoid deadlock. If the response channel fills up, the memory subsystem stops issuing further write completions and the kernel stalls.
Typical async_mmap read pattern:
void Reader(tapa::async_mmap<float>& mem, tapa::ostream<float>& out, int n) {
#pragma HLS pipeline II=1
for (int i_req = 0, i_resp = 0; i_resp < n;) {
if (i_req < n && !mem.read_addr.full()) {
mem.read_addr.write(i_req);
++i_req;
}
float val;
if (mem.read_data.try_read(val)) {
out.write(val);
++i_resp;
}
}
}
Utilities
tapa::vec_t<T, N>
An N-element SIMD vector of type T. Stores elements as a packed bit array, which maps directly to wide AXI ports. Supports element access via operator[], arithmetic operators element-wise, and common reductions (sum, product).
template <typename T, int N>
struct vec_t {
static constexpr int length = N;
static constexpr int width = widthof<T>() * N; // total bit width
T& operator[](int pos);
const T& operator[](int pos) const;
};
Related free functions: truncated<begin, end>(vec), cat(v1, v2), make_vec<N>(val).
tapa::widthof<T>()
Returns the bit width of type T. For ap_int<W> and ap_uint<W>, returns W. For plain C++ types, returns sizeof(T) * CHAR_BIT.
template <typename T>
inline constexpr int widthof();
template <typename T>
inline constexpr int widthof(T object); // deduce T from argument
EoT macros
End-of-transaction macros simplify consuming a stream until a sentinel marker is received.
| Macro | Description |
|---|---|
TAPA_WHILE_NOT_EOT(stream) | Loop body executes once per data element; loop exits when the EoT marker is seen. |
TAPA_WHILE_NEITHER_EOT(s1, s2) | Two-stream variant; exits when either stream reaches EoT. |
TAPA_WHILE_NONE_EOT(s1, s2, s3) | Three-stream variant. |
// Example: consume all elements from 'in' and forward to 'out'
TAPA_WHILE_NOT_EOT(in) {
out.write(in.read());
}
in.open(); // consume the EoT marker
out.close(); // send EoT marker downstream
Synthesis pragmas (C++ attributes)
These C++ attributes are recognised by TAPA and lowered to Vitis HLS pragmas during synthesis. They have no effect in software simulation.
| Attribute | Description |
|---|---|
[[tapa::pipeline(II)]] | Pipeline the enclosing loop or function with initiation interval II. |
[[tapa::unroll(factor)]] | Unroll the enclosing loop by factor. |
[[tapa::target("ignore")]] | Mark a task for custom RTL replacement. TAPA generates a port-signature template but does not synthesize the task body. |
[[tapa::target("ignore")]] was formerly written as [[tapa::target("non_synthesizable", "xilinx")]]. The "ignore" form is the current spelling.
tapa::hls sub-namespace
tapa::hls::stream<T> is a stream type that behaves like hls::stream<T> in software simulation: it has effectively infinite depth, so producers never block in simulation. Use it when incrementally migrating a Vitis HLS design and you want software simulation to pass without tuning stream depths. #include <tapa.h> includes this automatically.
tapa::hls::stream synthesizes to the same RTL FIFO as tapa::stream<T, N> with the declared depth N. The infinite depth only applies to software simulation. The practical reason to replace it before hardware build is that software simulation with tapa::hls::stream will not expose backpressure bugs — switching to tapa::istream<T>& / tapa::ostream<T>& with a tuned depth catches those bugs at simulation time rather than on hardware.
Output Files
Output Artifacts
The artifact produced by tapa depends on the target selected with --target.
Xilinx Vitis target (--target xilinx-vitis, the default)
Produces an .xo object file. This is passed to the Vitis v++ compiler for bitstream generation. An XO file is a ZIP archive; you can unzip it to inspect or manually edit the RTL it contains, then re-zip it before passing it to v++.
Xilinx HLS target (--target xilinx-hls)
Produces a .zip RTL archive instead of an .xo file. The archive contains the same RTL files and metadata but without the Vitis shell wrapper. Use this when the RTL is consumed directly by a downstream EDA tool.
Reproducibility
TAPA strips timestamps, absolute paths, and random IDs from both .xo and .zip artifacts before writing them to disk. Given the same source code and tool versions, repeated invocations produce byte-identical output. This makes the artifacts suitable for CI and release attestation workflows.
Byte identity holds only within the same vendor tool version. Upgrading Vitis HLS or Vivado will typically change internal artifact content even for identical source inputs.
Intermediate Files
When --work-dir is specified (recommended), TAPA writes intermediate files to that directory. The structure is:
work.out/
├── cpp/
├── flatten/
├── log/
├── tar/
├── hdl/
├── graph.json
├── settings.json
├── report.json
└── report.yaml
File and directory descriptions
cpp/
Contains per-task C++ source files extracted by tapa analyze. Each file is independently compiled to RTL by vitis_hls.
flatten/
Created during tapa analyze. Contains preprocessed (flattened) copies of the input source files, one per input file, with a short hash prefix in the filename to avoid collisions. All #include directives are expanded and comments are preserved, giving tapacc self-contained translation units to operate on.
log/
Stores logs from processing steps, including vitis_hls csynth_design logs.
tar/
Contains one .tar archive per task. Each archive holds the output of csynth_design for that task.
hdl/
Stores RTL files for all tasks generated by vitis_hls, plus TAPA-specific infrastructure RTL.
graph.json
JSON file recording all contents and metadata of the input design, including the task graph structure.
settings.json
Records compilation settings shared across pipeline steps (target, part number, clock period, platform). Downstream tapa sub-commands read this file to avoid repeating options on the command line.
report.json / report.yaml
Post-synthesis resource utilisation report, written unconditionally after tapa synth completes. Both files contain the same data in JSON and YAML encoding respectively. Passing --enable-synth-util to tapa synth additionally generates per-task .hier.util.rpt files under tar/, but does not affect whether these top-level report files are written.
C++ Quick Reference
Common patterns for writing TAPA kernels. For full API details see C++ API.
Task structure
// Upper-level task: declare streams, invoke leaf tasks. No computation.
void Top(tapa::mmap<const float> in, tapa::mmap<float> out, uint64_t n) {
tapa::stream<float, 16> q("q");
tapa::task()
.invoke(Load, in, n, q)
.invoke(Store, q, out, n);
}
// Leaf task: contains all computation.
void Load(tapa::mmap<const float> mem, uint64_t n, tapa::ostream<float>& q) {
for (uint64_t i = 0; i < n; ++i) q.write(mem[i]);
}
void Store(tapa::istream<float>& q, tapa::mmap<float> mem, uint64_t n) {
for (uint64_t i = 0; i < n; ++i) mem[i] = q.read();
}
Host code
#include <gflags/gflags.h>
#include <tapa.h>
DEFINE_string(bitstream, "", "XO or xclbin path. Empty = software simulation.");
int main(int argc, char* argv[]) {
gflags::ParseCommandLineFlags(&argc, &argv, true);
std::vector<float, tapa::aligned_allocator<float>> a(n), b(n);
tapa::invoke(Top, FLAGS_bitstream,
tapa::read_only_mmap<const float>(a),
tapa::write_only_mmap<float>(b),
(uint64_t)n);
}
FLAGS_bitstream value | Backend |
|---|---|
| (empty) | Software simulation |
kernel.xo | Fast cosimulation |
kernel.hw.xclbin | On-board execution |
Stream types
| Type | Use in signature | Direction |
|---|---|---|
tapa::stream<T, Depth> | local variable in upper task | owner |
tapa::istream<T>& | leaf task parameter | read only |
tapa::ostream<T>& | leaf task parameter | write only |
tapa::streams<T, N> | local variable | array owner |
tapa::istreams<T, N>& | leaf task parameter | array read |
tapa::ostreams<T, N>& | leaf task parameter | array write |
// Read
T val = in.read(); // blocking
bool ok = in.try_read(val); // non-blocking, returns true on success
// Write
out.write(val); // blocking
bool ok = out.try_write(val); // non-blocking
// State checks
bool e = in.empty();
bool f = out.full();
// End-of-transaction
out.close(); // send EoT marker
in.open(); // consume EoT marker
TAPA_WHILE_NOT_EOT(in) { ... } // loop until EoT
Stream depth and FPGA resource:
| Depth | Resource |
|---|---|
| < 128 | SRL shift-register (no BRAM) |
| ≥ 128 | BRAM |
| ≥ 4096 and element width ≥ 36 b | URAM |
Memory types
| Type | Signature | Access style |
|---|---|---|
tapa::mmap<T> | by value | synchronous, pointer-like |
tapa::async_mmap<T> | by reference & | decoupled AXI channels |
// mmap — simple loop
for (int i = 0; i < n; ++i) out[i] = in[i];
// async_mmap — overlapping reads (two-counter loop)
for (int64_t i_req = 0, i_resp = 0; i_resp < n;) {
#pragma HLS pipeline II=1
if (i_req < n) mem.read_addr.try_write(i_req++);
T val;
if (mem.read_data.try_read(val)) result[i_resp++] = val;
}
// async_mmap — writes with response drain
for (int64_t i_req = 0, i_resp = 0; i_resp < n;) {
#pragma HLS pipeline II=1
if (i_req < n && !src.empty() &&
!mem.write_addr.full() && !mem.write_data.full()) {
mem.write_addr.try_write(i_req);
mem.write_data.try_write(src.read(nullptr));
++i_req;
}
uint8_t ack;
if (mem.write_resp.try_read(ack)) i_resp += unsigned(ack) + 1;
}
Parallel task instances
// Invoke N instances; each gets a unique index via tapa::seq
tapa::streams<float, 4> ch("ch");
tapa::task().invoke<tapa::join, 4>(Worker, ch, tapa::seq{});
void Worker(tapa::istream<float>& in, int idx) { /* ... */ }
Useful pragmas
#pragma HLS pipeline II=1 // pipeline loop with II=1
#pragma HLS unroll factor=4 // partially unroll loop
// C++ attribute equivalents
[[tapa::pipeline(1)]]
[[tapa::unroll(4)]]
[[tapa::target("ignore")]] // mark task for custom RTL replacement
End-of-transaction macros
TAPA_WHILE_NOT_EOT(in) { out.write(in.read(nullptr)); }
TAPA_WHILE_NEITHER_EOT(in1,in2) { /* both have data */ }
TAPA_WHILE_NONE_EOT(a, b, c) { /* all three have data */ }
Build and run
# Software simulation
tapa g++ -- kernel.cpp host.cpp -o app
./app
# RTL synthesis
tapa compile --top Top --part-num xcu250-figd2104-2L-e \
--clock-period 3.33 -f kernel.cpp -o kernel.xo
# Fast cosimulation
./app --bitstream=kernel.xo
# Bitstream link (v++)
v++ -o app.hw.xclbin --link --target hw --kernel Top \
--platform xilinx_u250_gen3x16_xdma_4_1_202210_1 kernel.xo
# On-board run
./app --bitstream=app.hw.xclbin
Publications
Papers describing the TAPA compiler, the physical design toolflow it integrates, and accelerators built with TAPA.
Core Publications
TAPA Compiler
Yuze Chi, Licheng Guo, Jason Lau, Young-kyu Choi, Jie Wang, Jason Cong. Extending High-Level Synthesis for Task-Parallel Programs. IEEE FCCM, 2021. [PDF] [Code]
Introduces the TAPA task API, coroutine-based software simulation (3.2× faster than Vitis HLS sequential simulation), and fast hierarchical RTL generation (6.8× faster QoR iteration). Reduces kernel and host code by 22% and 51% on average versus Vitis HLS dataflow.
Licheng Guo, Yuze Chi, Jason Lau, Linghao Song, Xingyu Tian, Moazin Khatti, Weikang Qiao, Jie Wang, Ecenur Ustun, Zhenman Fang, Zhiru Zhang, Jason Cong. TAPA: A Scalable Task-Parallel Dataflow Programming Framework for Modern FPGAs with Co-optimization of HLS and Physical Design. ACM TRETS, 2023. [PDF] [Code]
Full journal treatment of the TAPA compiler and runtime. Average frequency improves from 147 MHz to 297 MHz (102%) across 43 designs; 16 previously unroutable designs achieve 274 MHz on average after co-optimization with physical design.
Floorplanning and Physical Design
Licheng Guo, Yuze Chi, Jie Wang, Jason Lau, Weikang Qiao, Ecenur Ustun, Zhiru Zhang, Jason Cong. AutoBridge: Coupling Coarse-Grained Floorplanning and Pipelining for High-Frequency HLS Design on Multi-Die FPGAs. ACM/SIGDA FPGA, 2021. (Best Paper Award) [PDF] [Code]
Doubles achievable clock frequency on average by automatically floorplanning HLS dataflow designs across SLR boundaries and inserting pipeline registers. Now maintained exclusively as a plug-in of the TAPA workflow.
Licheng Guo, Pongstorn Maidee, Yun Zhou, Chris Lavin, Jie Wang, Yuze Chi, Weikang Qiao, Alireza Kaviani, Zhiru Zhang, Jason Cong. RapidStream: Parallel Physical Implementation of FPGA HLS Designs. ACM/SIGDA FPGA, 2022. (Best Paper Award) [PDF]
Split compilation with parallel placement and routing per partition. Achieves 5–7× compile time reduction and up to 1.3× frequency increase on Xilinx U250.
Licheng Guo, Pongstorn Maidee, Yun Zhou, Chris Lavin, Eddie Hung, Wuxi Li, Jason Lau, Weikang Qiao, Yuze Chi, Linghao Song, Yuanlong Xiao, Alireza Kaviani, Zhiru Zhang, Jason Cong. RapidStream 2.0: Automated Parallel Implementation of Latency-Insensitive FPGA Designs through Partial Reconfiguration. ACM TRETS, 2023. [Link]
Extends RapidStream with virtual pins and partial reconfiguration. Achieves 5–7× compile time reduction and 1.3× frequency increase on Xilinx U280, approximately 2× faster than RapidStream 1.0.
Jason Lau, Yuanlong Xiao, Yutong Xie, Yuze Chi, Linghao Song, Shaojie Xiang, Michael Lo, Zhiru Zhang, Jason Cong, Licheng Guo. RapidStream IR: Infrastructure for FPGA High-Level Physical Synthesis. IEEE/ACM ICCAD, 2024. [PDF]
Generalizes RapidStream into a reusable IR for FPGA high-level physical synthesis. Supports multiple task-parallel HLS frontends including TAPA and PASTA.
Compiler Extensions
Young-kyu Choi, Yuze Chi, Jason Lau, Jason Cong. TARO: Automatic Optimization for Free-Running Kernels in FPGA High-Level Synthesis. IEEE TCAD, 2022. [Link]
Eliminates unnecessary control logic for streaming applications. Achieves 16% LUT and 45% FF reduction on systolic-array designs on Alveo U250. Integrated into the TAPA compilation flow.
Neha Prakriya, Yuze Chi, Suhail Basalama, Linghao Song, Jason Cong. TAPA-CS: Enabling Scalable Accelerator Design on Distributed HBM-FPGAs. ACM ASPLOS, 2024. [arXiv] [Code]
Extends TAPA to automatically partition designs across a cluster of FPGAs with the --multi-fpga N compiler flag. Handles congestion control, resource balancing, and inter-FPGA pipelining.
Moazin Khatti, Xingyu Tian, Yuze Chi, Licheng Guo, Jason Cong, Zhenman Fang. PASTA: Programming and Automation Support for Scalable Task-Parallel HLS Programs on Modern Multi-Die FPGAs. IEEE FCCM, 2023; extended in ACM TRETS, 2024. [Link]
Adds automated latency-insensitive buffer (ping-pong) channel synthesis alongside FIFO streams in the task-parallel HLS flow, targeting the same class of multi-die FPGA designs as TAPA.
Suhail Basalama, Jason Cong. Stream-HLS: Towards Automatic Dataflow Acceleration. ACM/SIGDA FPGA, 2025. [Paper] [Code]
MLIR-based compiler that takes PyTorch or C/C++ and automatically generates optimized TAPA dataflow accelerators. Outperforms prior automation frameworks by up to 79× and manually-optimized TAPA designs by up to 11× geometric mean.
Akhil Raj Baranwal, Zhenman Fang. PoCo: Extending Task-Parallel HLS Programming with Shared Multi-Producer Multi-Consumer Buffer Support. ACM TRETS, 2025. [PDF]
Generalizes TAPA and PASTA's point-to-point SPSC channels to shared multi-producer–multi-consumer buffer abstractions with placement-aware optimizations for multi-die FPGAs.
Application Papers
Accelerators built with the TAPA compiler and toolflow.
Sparse Linear Algebra
Linghao Song, Yuze Chi, Atefeh Sohrabizadeh, Young-kyu Choi, Jason Lau, Jason Cong. Sextans: A Streaming Accelerator for General-Purpose Sparse-Matrix Dense-Matrix Multiplication. ACM/SIGDA FPGA, 2022. [PDF] [Code]
SpMM accelerator on Alveo U280/U250. TAPA/AutoBridge-compiled DDR variant achieves 260 MHz versus a Vivado baseline of 189 MHz. Up to 2.50× geomean speedup over NVIDIA K80.
Linghao Song, Yuze Chi, Licheng Guo, Jason Cong. Serpens: A High Bandwidth Memory Based Accelerator for General-Purpose Sparse Matrix-Vector Multiplication. ACM/IEEE DAC, 2022. [Code]
SpMV accelerator on Alveo U280 using 24 HBM channels. The Vitis HLS baseline failed to route; TAPA + AutoBridge achieves 270 MHz and up to 60.55 GFLOP/s.
Linghao Song, Licheng Guo, Suhail Basalama, Yuze Chi, Robert F. Lucas, Jason Cong. Callipepla: Stream Centric Instruction Set and Mixed Precision for Accelerating Conjugate Gradient Solver. ACM/SIGDA FPGA, 2023. [Code]
Conjugate gradient solver on U280 HBM. 3.94× speedup and 2.94× better energy efficiency over Xilinx XcgSolver; 3.34× better energy efficiency and 77% throughput of an A100 GPU at 4× lower memory bandwidth. Built with TAPA and AutoBridge.
Zifan He, Linghao Song, Robert F. Lucas, Jason Cong. LevelST: Stream-based Accelerator for Sparse Triangular Solver. ACM/SIGDA FPGA, 2024. [Paper] [Code]
First HBM-FPGA accelerator for SpTRSV. 2.65× speedup and 9.82× higher energy efficiency versus V100/RTX 3060 with cuSPARSE. Built on TAPA with AutoBridge floorplanning.
Manoj B. Rajashekar, Xingyu Tian, Zhenman Fang. HiSpMV / MAD-HiSpMV: Hybrid Row Distribution and Vector Buffering for Imbalanced SpMV Acceleration on FPGAs. ACM/SIGDA FPGA, 2024; extended in ACM TRETS, 2025. [Paper] [Code]
SpMV accelerator on Alveo U280 adapting row distribution to matrix structure. Uses TAPA for hardware build, cosimulation, and hardware emulation.
Ahmad Sedigh Baroughi, Xingyu Tian, Moazin Khatti, Akhil Raj Baranwal, Yuze Chi, Licheng Guo, Jason Cong, Zhenman Fang. HiSpMM: High Performance High Bandwidth Sparse-Dense Matrix Multiplication on HBM-equipped FPGAs. ACM TRETS, 2025. [Paper] [Code]
SpMM accelerator on Alveo U280 using TAPA for hardware generation, cosimulation, and runtime.
Graph Analytics
Yuze Chi, Licheng Guo, Jason Cong. Accelerating SSSP for Power-Law Graphs (SPLAG). ACM/SIGDA FPGA, 2022. [Paper] [Code]
FPGA SSSP accelerator on Alveo U280. Up to 4.9× over prior FPGA accelerators, 2.6× over a 32-thread CPU, 0.9× of A100 GPU at 4.1× the power budget. Fully parameterized TAPA HLS C++ implementation.
Systolic Arrays and Machine Learning
Jie Wang, Licheng Guo, Jason Cong. AutoSA: A Polyhedral Compiler for High-Performance Systolic Arrays on FPGA. ACM/SIGDA FPGA, 2021. [Paper] [Code]
Polyhedral systolic array compiler targeting MM, CNN, LU, MTTKRP. Integrated with TAPA and AutoBridge for routing congestion resolution and frequency improvement.
Suhail Basalama, Atefeh Sohrabizadeh, Jie Wang, Licheng Guo, Jason Cong. FlexCNN: An End-to-End Framework for Composing CNN Accelerators on FPGA. ACM TRETS, 2023. [Paper] [Code]
CNN compilation framework for OpenPose, U-Net, E-Net, and VGG-16 on Alveo U250/U280. TAPA code generation added as a journal contribution. 2.3× performance improvement; 5× further speedup via software-hardware pipelining.
K-Nearest Neighbors
Alec Lu, Zhenman Fang, Nazanin Farahpour, Lesley Shannon. CHIP-KNN: A Configurable and High-Performance K-Nearest Neighbors Accelerator on Cloud FPGAs. IEEE ICFPT, 2020. [Code]
KNN accelerator on Alveo U280. TAPA-compiled design achieves 252 MHz versus a Vivado baseline of 165 MHz.
Kenneth Liu, Alec Lu, Kartik Samtani, Zhenman Fang, Licheng Guo. CHIP-KNNv2: A Configurable and High-Performance K-Nearest Neighbors Accelerator on HBM-based FPGAs. ACM TRETS, 2023. [Paper] [Code]
Streaming-based redesign on Alveo U280 with automated TAPA HLS C code generation. Up to 45× speedup over a 48-thread CPU.
Multi-FPGA Applications
Tianqi Zhang, Neha Prakriya, Sumukh Pinge, Jason Cong, Tajana Rosing. SpectraFlux: Harnessing the Flow of Multi-FPGA in Mass Spectrometry Clustering. ACM/IEEE DAC, 2024. [Paper]
Uses TAPA-CS to partition a mass spectrometry clustering workload across multiple networked HBM-FPGAs.
Glossary
analyze
The tapa analyze step. Parses the C++ source with tapacc (a Clang-based tool) and extracts the task graph and inter-task channels to graph.json in the work directory. This step does not invoke any vendor tools and runs on any host.
async_mmap
A decoupled memory access type (tapa::async_mmap<T>). Instead of stalling on each memory operation, the kernel issues requests through address FIFOs and collects results through data and response FIFOs independently. This decoupling allows the kernel to keep the memory bus busy even when computation is not complete, enabling higher effective memory bandwidth. async_mmap must be passed by reference in task signatures.
backpressure
The condition where a producer cannot write to a stream because the downstream consumer has not yet drained elements from the FIFO and the buffer is full. The producer blocks until the consumer reads at least one element. Backpressure propagates naturally through TAPA streams and is the primary flow-control mechanism.
cosim (see also: fast cosim)
Hardware cosimulation. Runs RTL simulation using the XO artifact to verify the hardware implementation against the software model. TAPA supports fast cosim, which uses the XO directly without running full Vivado implementation. See also: fast cosim.
detached task
A task invoked with .invoke<tapa::detach>(). A detached task runs concurrently with its siblings but the parent does not wait for it to finish before returning. Useful for background tasks such as monitors or credit managers. See tapa::task in the API reference.
EoT (end-of-transaction)
A sentinel value written to a stream to signal the end of a data sequence. The producer calls ostream::close() to write the EoT marker; the consumer calls istream::open() to consume it. The TAPA_WHILE_NOT_EOT macro automates looping until EoT is detected.
fast cosim
Synonym for cosim in the TAPA context. Fast cosim is invoked by passing a .xo file as the --bitstream argument to the host executable. The host executable runs the Rust libfrt cosim runtime in-process, which avoids a full Vivado implementation run and is significantly faster than traditional cosim flows.
leaf task
A task that contains only computation and does not call .invoke(). Leaf tasks are the units of synthesis: each leaf task is compiled to RTL by Vitis HLS independently. A leaf task may use streams, mmap, or async_mmap parameters.
mmap
Memory-mapped region. A contiguous block of host memory exposed to the kernel as a pointer-like handle (tapa::mmap<T>). The kernel accesses it synchronously, similar to a C pointer. For pipelined non-blocking access, use async_mmap instead. mmap is passed by value in task signatures.
mmaps
An array of N mmap regions (tapa::mmaps<T, N>) passed as a single argument. The framework distributes one region per child task invocation when the parent iterates over N instances.
pack
The tapa pack step. Packages per-task RTL produced by tapa synth into a single XO (or ZIP) artifact suitable for passing to v++ or for use in fast cosim.
remote execution
Offloading vendor-tool steps (HLS, pack) to a remote Linux host over SSH. Configured with --remote-host. The local machine runs tapacc (the analyze step) and transfers source files; the remote host runs Vitis HLS. Useful when cross-compiling from macOS or when the local machine lacks a Vitis licence.
stream
A FIFO channel between tasks (tapa::stream<T, Depth>). Streams are the fundamental communication primitive in TAPA. A stream is declared in an upper-level task and passed to child tasks as istream<T>& (read end) or ostream<T>& (write end). The FIFO enforces backpressure automatically.
stream depth
The number of elements the FIFO can hold before the producer blocks. Declared as the second template parameter of tapa::stream<T, Depth>. The default depth is 2. Increasing depth decouples producer and consumer and can improve throughput at the cost of FPGA BRAM or LUT resources.
synth
The tapa synth step. Runs Vitis HLS on each leaf task extracted during tapa analyze to produce per-task Verilog RTL. Results are stored in tar/ and hdl/ under the work directory.
TAPA_CONCURRENCY
Environment variable controlling the number of coroutine threads used during software simulation. Set to 1 to force sequential execution (useful for debugging). The default is the number of physical CPU cores on the host machine.
top-level task (upper-level task)
A task that only invokes other tasks via tapa::task().invoke() and contains no direct computation. A top-level task maps to a system-level wrapper in RTL that wires sub-task ports together. The top-level task is specified with --top on the tapa command line.
work directory
The directory where TAPA stores all intermediate artifacts between pipeline steps. Set with --work-dir. The default is work.out/ in the current directory. See Output Files for the full directory structure.
xclbin
Xilinx compiled binary. The final bitstream file produced by Vivado implementation. An xclbin is loaded onto the FPGA by the host application at runtime (via XRT or FRT). It is produced by running v++ --link on an XO file.
xo
Xilinx object file. The intermediate artifact produced by tapa pack, containing all per-task RTL and metadata in a ZIP archive. The XO is the input to v++ --link for bitstream generation, and is also passed as --bitstream to the host executable for fast hardware cosimulation.
Building from Source
This guide is for developers contributing to or extending TAPA, or advanced users building TAPA from source for custom OS support. For FPGA accelerator development with TAPA, refer to the User Documentation. This is also the recommended way to install TAPA for all users.
If your OS isn't officially supported, consider using a virtual machine or file a feature request on GitHub.
System Prerequisites
To build TAPA from source, you need:
- Bazel 7.3.2 or later
- Binutils 2.30 or later
- Git
- Libstdc++ matching the most recent GCC version installed on your system
- Python 3.13 or later (Bazel fetches its own managed toolchain; this version applies to the Bazel-managed Python, not necessarily the host system Python)
- Other TAPA dependencies
Install these tools using your OS package manager. For Ubuntu:
# Install bazel
sudo apt-get install apt-transport-https ca-certificates gnupg
curl -fsSL https://bazel.build/bazel-release.pub.gpg \
| gpg --dearmor | sudo tee /usr/share/keyrings/bazel-archive-keyring.gpg
echo "deb [arch=amd64 signed-by=/usr/share/keyrings/bazel-archive-keyring.gpg] \
https://storage.googleapis.com/bazel-apt stable jdk1.8" \
| sudo tee /etc/apt/sources.list.d/bazel.list
sudo apt-get install bazel
# Install other tools
sudo apt-get install binutils git python3
For Bazel installation on other OS, see the Bazel documentation.
The Dockerfile in the TAPA repository provides a complete build environment. Use it for containerized builds or run the Ubuntu commands to install required tools.
Clone the Repository
To get started with building TAPA from source, you'll need to clone the repository from GitHub:
git clone https://github.com/tuna/tapa.git
If you are contributing to TAPA, fork the repository and clone your fork instead. When you're ready to contribute, create a new branch for your changes, commit your work, and open a pull request to contribute your changes back to the main repository.
Modify the Build Configuration
You may need to modify the VARS.bzl file in the repository's root directory
to specify the correct Vivado installation paths and versions. The build script
currently assumes default installation paths at
/opt/tools/xilinx/Vivado/2024.2 and
/opt/tools/xilinx/Vivado/2022.2 for Vivado, and
/opt/tools/xilinx/Vitis/2024.2 for Vitis.
If your Xilinx tools are installed in non-standard locations, please modify
the XILINX_TOOL_PATH variable to reference the correct base installation
directory for your Vivado and Vitis installations. You should also update
XILINX_TOOL_VERSION to specify the version of the latest Xilinx tools
you have installed. With these settings properly configured, the system
will expect your Vivado installation to be located at
{XILINX_TOOL_PATH}/Vivado/{XILINX_TOOL_VERSION}.
Furthermore, you should configure XILINX_TOOL_LEGACY_VERSION to indicate
the earliest version of Xilinx tools installed on your system, along with
XILINX_TOOL_LEGACY_PATH to point to the corresponding installation
directory.
If your system does not have the Xilinx Runtime (XRT) installed, you can
modify the HAS_XRT variable in the VARS.bzl file to False. This
will prevent the tests to fail due to the absence of XRT.
Build TAPA from Source
To build TAPA, navigate to the root directory of the cloned repository and execute the following command:
bazel build //...
This command compiles all TAPA targets, including the compiler, runtime library, and tests.
For building a specific target, replace //... with the desired target
name. For instance, to build only the TAPA compiler:
bazel build //tapa
To skip building for the tests, you could use:
bazel build //... -- -//tests/...
After the build process completes, you can find the compiled binaries in the
bazel-bin directory. For example, the TAPA compiler binary is located at
bazel-bin/tapa/tapa.
The build process duration may vary depending on your system's performance. LLVM, a significant dependency used by TAPA for code generation, requires considerable time to build. Bazel will cache it after the initial build.
Use the Built TAPA
Once TAPA is built, you can use the compiled TAPA compiler to compile your designs. For example:
bazel-bin/tapa/tapa compile \
-f tests/apps/bandwidth/bandwidth.cpp \
--cflags -Itests/apps/bandwidth/ \
-t Bandwidth \
--clock-period 3 \
--part-num xcu250-figd2104-2L-e
Remember to rerun the bazel build command whenever you make changes to the
TAPA compiler or runtime library to ensure you're using the latest version.
Build the Documentation
The documentation is written in Markdown and built with mdBook. The Bazel build rules fetch the correct mdBook and mdbook-admonish binaries automatically — no separate install is needed.
Build a static HTML site:
bazel build //docs:build
The output is a tarball at bazel-bin/docs/book.tar.gz. Extract it to browse the HTML locally.
Serve with live reload during editing:
bazel run //docs:serve
This starts a local server (default: http://localhost:3000) that reloads automatically when source files change. Supported on Linux x86_64, macOS x86_64, and macOS arm64.
The documentation source lives under docs/src/. The Bazel targets handle mdbook-admonish preprocessing automatically; do not run mdbook-admonish install manually in the source tree.
Run TAPA Tests
To run all TAPA tests, including unit tests and integration tests, use the following command in the repository's root directory:
bazel test //...
For running a specific test, replace //... with the test name. For example,
to test only a specific app:
bazel test //tests/apps/vadd:vadd-xosim
Build Binary Distribution
To create a binary distribution of TAPA, navigate to the root directory of the cloned repository and execute the following command:
bazel build --config=release //:tapa-pkg-tar
Find the generated binary distribution in the bazel-bin directory,
as a tarball named tapa-pkg-tar.tar.
Install the Binary Distribution
To install the binary distribution, extract the tarball to a directory of your choice:
tar -xvf bazel-bin/tapa-pkg-tar.tar -C /path/to/install
Access the TAPA compiler binary at /path/to/install/usr/bin/tapa.
Containerized Build (Advanced)
For those who prefer a containerized build environment, TAPA offers a GitHub
Actions workflow that can be run locally using act. This approach ensures
a consistent build environment across different systems.
Prerequisites
-
Install
actby following the instructions in the act repository. -
Ensure Docker is installed on your system, as
actrequires it to run the workflow.
Configuration
Before running act, set up the following configuration files:
-
Create a
.secretsfile in the repository root with the following content:UBUNTU_PRO_TOKEN=[YOUR_UBUNTU_PRO_TOKEN] MAC_ADDRESS=de:ed:be:ef:ca:feReplace
[YOUR_UBUNTU_PRO_TOKEN]with your Ubuntu Pro token (available free for personal use) andde:ed:be:ef:ca:fewith your Vivado license MAC address. -
Update the
.actrcfile in the repository root:--secret-file .secrets -
If your Vivado license and installation locations differ from the defaults (
/share/software/licenses/xilinx-ci.licand/share/software/toolsrespectively), update.github/actions/run-docker/action.ymlaccordingly.
Running Containerized Tests
To test TAPA in the containerized environment:
act -j test
This method often provides more consistent results than local testing due to the isolated environment. It also benefits from a shared Bazel cache between runs, potentially speeding up the build process.
Build artifacts are not saved to the local bazel-bin directory in
containerized builds. For debugging, you may need to build TAPA in your
local environment. However, you can still add test cases and use act
for testing your changes.
Creating a Binary Distribution
To create a binary distribution of TAPA:
act -j build
The resulting binary distribution is saved in the artifacts.out directory
in the repository root (e.g., artifacts.out/1/tapa/tapa.tar.gz for the
first build).
Installing the Binary Distribution
To install the binary distribution:
-
Extract the tarball to your preferred directory, or
-
Use the provided
install.shscript to install TAPA to the default location:TAPA_LOCAL_PACKAGE=./artifacts.out/1/tapa/tapa.tar.gz ./install.sh
Developing TAPA
This section is intended for developers who want to contribute to TAPA. It explains the development process, the code structure, and the guidelines for contributing to the TAPA framework.
Development Environment
TAPA enforces a consistent coding style and provides tools to ensure code quality. Follow these steps to set up your development environment.
Install Pre-Commit Hooks
pip install pre-commit
pre-commit install
The latest version of pre-commit is required, which depends on a newer Python version. Some hooks may fail if your Python version is outdated.
Pre-commit hooks run automatically before each commit to ensure code compliance with style guidelines. To manually run the checks:
pre-commit run --all-files
Install Python Dependencies for IDEs
While Bazel automatically installs required Python dependencies during build and test, you can manually install them for IDE access:
pip install -r tapa/requirements_lock.txt
Setting C++ Compiler Options for IDEs
Generate a compile_commands.json file to configure your IDE with Bazel's
compiler options:
bazel run //:refresh_compile_commands
Code Structure
The TAPA codebase is organized into several key directories:
-
bazel/: Contains Bazel build configurations.It defines how the TAPA compiler is used in the Bazel build system, and provides additional utilities for building and testing TAPA.
-
docs/: Includes documentation files.The documentation is written in Markdown and built using mdBook.
-
fpga-runtime/: Provides the FPGA runtime library.The FPGA runtime library is used to interact with simulator or FPGA based on provided bitstream. It uses fast lightweight simulator for cosimulation with XO object file, and interacts with XRT library for Vitis simulation or on-board testing with XCLBIN file.
-
tapa-cpp/: Customizes the Clang C++ preprocessor for TAPA.The TAPA C++ preprocessor reprocesses TAPA C++ code before passing to
tapacccompiler. It supports TAPA-specific features, such as[[tapa::pipeline]]annotations (maps to Vitis HLS PIPELINE pragma) and[[tapa::unroll]]annotations (maps to Vitis HLS UNROLL pragma). -
tapa-lib/: Houses the TAPA runtime library.The TAPA runtime library provides core functionality for TAPA tasks, streams, and memory maps. It implements platform-specific features (e.g., software simulation queues, hardware FIFOs).
-
tapa-llvm-project/: Contains the LLVM project with TAPA-specific patches (fetched as an external Bazel dependency, not checked in to the repository).TAPA uses LLVM Clang to generate system interconnect and transformed C++ code for each task. The LLVM project is customized with TAPA-specific features, such as C++ annotations.
-
tapa-system-include/: Creates a custom system include directory for TAPA.This Bazel build target collects system include files for
tapa-cppandtapacccompilers. It includes standard C++ headers, TAPA dependencies, and TAPA-specific headers for the compilers to run on every OS. -
tapa/: Contains the core TAPA compiler and runtime library.The TAPA compiler serves as the entry point for the TAPA framework. It invokes
tapa-cppandtapacccompilers, synthesizes tasks into RTL using HLS tools, and generates system interconnect and XO object file for FPGA. For thexilinx-hlstarget, a.zipRTL archive is generated instead. -
tapacc/: Implements the TAPA C++ compiler to translate TAPA tasks to JSON.The TAPA C++ compiler is a Clang-based compiler for TAPA tasks. It analyzes tasks and streams, generating JSON representation of tasks and dataflow.
-
tests/: Includes test cases for the TAPA compiler and runtime library.The folder includes various TAPA applications. It includes microbenchmarks under
apps/for basic functionality testing, andregression/for performance evaluation of TAPA compiled designs.
Update Dependencies
TAPA depends on several external libraries and tools. This section explains how to update these dependencies.
General Version Bump Process
When bumping versions, follow this general workflow:
-
Clear existing lock files.
-
Update dependency declarations.
-
Regenerate lock files.
-
Test the build.
-
Commit changes.
Bazel Dependencies
For Bazel dependencies:
-
Update the version numbers in
MODULE.bazel. -
Check the Bazel Central Registry for latest versions, and update the
bazel_depentries inMODULE.bazelaccordingly. -
Remove
MODULE.bazel.lockto force regeneration.
For Python and Node.js toolchains in MODULE.bazel:
# Update Python version
python.toolchain(
python_version = "3.13.2", # Update version here
...
)
use_repo(python, python_3_13 = "python_3_13_2") # Update repo name too
# Update Python version in pip declaration
pip.parse(
python_version = "3.13.2", # Update version here
...
)
# Update Node.js version
node.toolchain(node_version = "17.9.1")
Python Dependencies
To update Python packages:
# Clear existing lock file
echo > tapa/requirements_lock.txt
# Update the dependencies
bazel run //tapa:requirements.update
This will regenerate the requirements_lock.txt file with the latest
compatible versions.
XRT Dependency
For XRT (Xilinx Runtime):
-
Check the XRT GitHub releases for latest versions.
-
Update the version and SHA256 checksum in
MODULE.bazel:XRT_VERSION = "202420.2.18.179" # Update version XRT_SHA256 = "..." # Update SHA256 checksum -
Calculate SHA256 checksum with:
curl -L https://github.com/Xilinx/XRT/archive/refs/tags/{VERSION}.tar.gz | sha256sum
LLVM Version Updates
To update the LLVM version:
-
Find the latest stable release of LLVM on LLVM GitHub releases.
-
Update the version numbers in
MODULE.bazel:LLVM_VERSION_MAJOR = 20 LLVM_VERSION_MINOR = 1 LLVM_VERSION_PATCH = 4 -
Update the SHA256 checksum after downloading the new version:
LLVM_SHA256 = "<new_sha256_checksum>"
Docker Images
For the Docker testing and building environments:
-
Update the base image versions in
.github/docker/*. -
Update the system dependencies trigger date to the current date, so that the Docker image is rebuilt with the latest system dependencies:
RUN apt-get update && \ # Update the following line to the latest date for retriggering the docker build echo "Installing system dependencies as of 20250505" && \ apt-get upgrade -y
Pre-commit Hooks
Update pre-commit hooks to the latest versions:
pre-commit autoupdate
Verifying Updates
After updating dependencies:
-
Remove the lock file:
rm MODULE.bazel.lock -
Run a full build:
bazel build //... -
Run the pre-commit checks:
pre-commit run --all-files -
Commit the changes:
git commit -a -m "build(deps): bump versions"
This section provides guidance on updating all types of dependencies in the TAPA project, including where to find the latest versions and how to verify that the updates work correctly.
Contributing to TAPA
Pull Request Process
- Fork the TAPA repository and create a new branch for your feature or bug fix.
- Ensure all tests pass and pre-commit hooks run successfully.
- Write a clear and concise description of your changes in the pull request.
- Request a review from the TAPA maintainers.
Continuous Integration
TAPA uses GitHub Actions for continuous integration. The CI pipeline:
- Builds binary distributions on Ubuntu 18.04 self-hosted runners.
- Performs code quality checks using pre-commit hooks on every commit.
- Runs functional and integration tests via staging workflows across a matrix of platforms and Vitis versions for every main branch push.
Documentation
-
Update the documentation in the
docs/directory for any new features or changes. -
Use Markdown format for documentation files.
-
Run the following command in the
docs/directory to build and preview documentation changes locally:bash build.sh
Testing
-
Add appropriate unit tests for new features or bug fixes.
-
Ensure all existing tests pass before submitting your changes.
-
Run the full test suite using the following command:
bazel test //...
Reporting Issues
- Use the GitHub issue tracker to report bugs or suggest new features.
- Provide a clear and concise description of the issue or feature request.
- Include steps to reproduce the issue, if applicable.
- Attach relevant log files or screenshots, if available.
Community Guidelines
- Be respectful and considerate in all interactions with other contributors.
- Provide constructive feedback on pull requests and issues.
Releasing TAPA Builds
This section explains how to release TAPA builds. It is intended for maintainers with write access to the TAPA repository.
Automated Release Process
Releases are automated via GitHub Actions. The publish-release.yml
workflow builds and publishes a release to GitHub Releases.
To create a release:
-
Update the
VERSIONfile onmainwith the desired version string (e.g.0.1.20260319). -
Trigger the
Publish Releaseworkflow viaworkflow_dispatchfrom the GitHub Actions UI. Optionally override the version in the input field; if left blank, the contents of theVERSIONfile are used.
The workflow will:
- Build the release tarball on a self-hosted runner
- Create the git tag
v<version>onmain - Publish
tapa.tar.gzandtapa-visualizer.tar.gzto GitHub Releases
Staging Builds
Every push to main triggers the staging-build.yml workflow, which
runs the full test matrix across all supported OS and Vitis version
combinations. Staging builds are uploaded as workflow artifacts (retained
for 7 days) but are not published as releases.
Installing a Release
Users can install a published release with:
curl -fsSL https://raw.githubusercontent.com/tuna/tapa/main/install.sh | sh -s -- -q
To install a specific version by tag:
curl -fsSL https://raw.githubusercontent.com/tuna/tapa/main/install.sh | TAPA_VERSION=x.y.z sh -s -- -q
To install from a local release tarball:
TAPA_LOCAL_PACKAGE=./tapa.tar.gz ./install.sh -q