Welcome

TAPA is a task-parallel HLS framework that compiles C++ dataflow programs to Verilog RTL for Xilinx FPGAs, with software simulation requiring no FPGA hardware.

C++ source → tapa compile → RTL (.xo) → Vitis v++ → FPGA bitstream

Choose your path

New to FPGA? → Your First Run
Migrating from Vitis HLS? → Lab 3: Migrating from Vitis HLS
Already know FPGA? → How-To Guides (start with software simulation)

Installation

Purpose: Install TAPA on your development machine.

When to use this: Setting up TAPA for the first time.

What you need

Dependency	Version	Notes
GNU C++ Compiler (`g++`)	7.5.0 or newer	Required for software simulation and deployment
Xilinx Vitis	2022.1 or newer	Not needed for software simulation — only required for RTL synthesis and deployment

TAPA has been tested on the following operating systems:

OS	Minimum version	Notes
Ubuntu	18.04
Debian	10
Red Hat Enterprise Linux	9	Derivatives (AlmaLinux 9+, Rocky Linux 9+) also supported
Amazon Linux	2023
Fedora	34	Fedora 39+ may have minor issues due to C library changes and Vitis HLS incompatibility

Install from release

curl -fsSL https://raw.githubusercontent.com/tuna/tapa/main/install.sh | sh -s -- -q

This installs the current stable release (0.1.20260319). With root privileges, TAPA installs to /opt/tapa with symlinks in /usr/local/bin. Otherwise it installs to ~/.tapa and adds itself to your PATH via your shell profile.

Rust migration in progress

TAPA's internal toolchain is being incrementally refactored to Rust for improved performance and reliability. During this transition, we recommend staying on the stable release (0.1.20260319) for production workloads. To try the latest (potentially unstable) release instead, pass --beta:

curl -fsSL https://raw.githubusercontent.com/tuna/tapa/main/install.sh | sh -s -- -q --beta

To install a specific version:

curl -fsSL https://raw.githubusercontent.com/tuna/tapa/main/install.sh \
  | TAPA_VERSION=0.1.20260319 sh -s -- -q

Releases are available at github.com/tuna/tapa/releases.

Install g++

Install g++ using the package manager for your OS.

Ubuntu / Debian

For Ubuntu 18.04 and newer, or Debian 10 and newer:

sudo apt-get install g++

RHEL / Amazon Linux

For Red Hat Enterprise Linux 9 and newer, derivatives like AlmaLinux 9 and newer and Rocky Linux 9 and newer, or Amazon Linux 2023:

sudo yum install gcc-c++ libxcrypt-compat

Fedora

For Fedora 34 and newer. Fedora 39 and newer may have minor issues due to system C library changes and Vitis HLS tool incompatibility.

sudo yum install gcc-c++ libxcrypt-compat

Verify installation

tapa --version

Building from source

For source builds (full toolchain requirements and build commands), see Building from Source.

Warning

If installation fails, see Common Errors for known issues.

Next step: Your First Run

Your First Run

Run your first TAPA software simulation without FPGA hardware.

When to use this

Use this guide when you are learning TAPA for the first time, or when you want to quickly verify a design's correctness without synthesizing RTL or running on physical hardware.

What you need

TAPA installed — see Installation
g++ 7.5.0 or newer (check with g++ --version)
The vadd example files: vadd.cpp and vadd-host.cpp

Commands

Compile the kernel and host code together using the tapa g++ wrapper, then run the resulting binary with no arguments to trigger software simulation:

tapa g++ -- vadd.cpp vadd-host.cpp -o vadd
./vadd

Note

tapa g++ is a wrapper around the GNU C++ compiler that automatically includes the necessary TAPA headers and libraries. It prints the underlying g++ command it invokes for reference.

Both the kernel file (vadd.cpp) and the host file (vadd-host.cpp) must be passed in the same command. The kernel file is used for software simulation.

Expected output

I20000101 00:00:00.000000 0000000 task.h:66] running software simulation with TAPA library
kernel time: 1.19429 s
PASS!

What this proves

The PASS! line confirms the vector addition produced correct results. The first line shows that TAPA executed the kernel on the CPU using its coroutine-based software simulator — no FPGA or Xilinx tools were involved.

If something goes wrong

If the build fails, the binary hangs, or the output shows FAIL!, see Your First Debug Cycle.

Next step

The Programming Model

Your First Debug Cycle

Diagnose and fix failures in TAPA software simulation.

Prerequisites

TAPA installed — see Installation
A simulation binary built with tapa g++ — see Your First Run

Symptom

The simulation hangs without producing output, crashes with an error, or prints FAIL! instead of PASS!.

How to confirm: run single-threaded

By default TAPA runs each task in its own coroutine using a thread pool sized to the number of physical CPU cores. Reducing concurrency to one thread improves reproducibility and simplifies debugging:

TAPA_CONCURRENCY=1 ./vadd

If the hang disappears or a crash becomes reproducible, the problem is likely a race condition or a deadlock that only manifests under concurrent execution.

Fix patterns

Attach GDB

Software simulation runs as a normal CPU process, so a debugger works without any special setup:

gdb ./vadd

Set a breakpoint on any TAPA task function by name and run:

(gdb) b VecAdd
(gdb) run

You can set breakpoints on any leaf task (Add, Mmap2Stream, Stream2Mmap, etc.) and step through the code exactly as you would for a regular C++ program.

Dump stream contents

Set TAPA_STREAM_LOG_DIR to a directory path before running. TAPA will write one log file per named stream under that directory, recording every value written to the stream:

TAPA_STREAM_LOG_DIR=/tmp/logs ./vadd

Log format:

Primitive types (int, float, …) are written as decimal text, one value per line.
Structs without operator<< are written as little-endian hex.
Structs with operator<< are written using your operator.

After the run, inspect the files under /tmp/logs/ to trace data as it flows through each stream and locate where incorrect values first appear.

Common mistakes to check

Symptom	Likely cause	Fix
Hangs forever	Deadlock or backpressure — a stream is full or empty and no task can make progress	Deadlocks & Hangs
Wrong output (`FAIL!`)	Logic error in a leaf task	Attach GDB or dump stream contents (above)
Build fails with template errors	Pass-by-value/reference mismatch on streams or mmaps	Common Errors

Tip

Always pass your design through software simulation before attempting RTL synthesis or hardware simulation. Software simulation compiles in seconds, and standard tools like GDB and AddressSanitizer work without modification.

To catch memory errors, compile with sanitizers:

tapa g++ -- vadd.cpp vadd-host.cpp -fsanitize=address -g -o vadd

Next step

Full FPGA Compilation

Compile a TAPA design to an FPGA bitstream and run it on hardware.

When to use this

Use this guide after software simulation passes (see Your First Run) and you are ready to target real hardware or run a more accurate RTL-level simulation.

What you need

TAPA installed — see Installation
Xilinx Vitis 2022.1 or newer
A compatible Alveo platform (the examples below use the U250)
The vadd source files: vadd.cpp and vadd-host.cpp

Stage 1 — Synthesize to RTL

Run tapa compile to translate the C++ kernel into an RTL object (.xo):

tapa \
  compile \
  --top VecAdd \
  --part-num xcu250-figd2104-2L-e \
  --clock-period 3.33 \
  -f vadd.cpp \
  -o vecadd.xo

Flag	Meaning
`--top`	Name of the top-level TAPA task
`--part-num`	Target FPGA part number
`--clock-period`	Target clock period in nanoseconds
`-f`	Kernel source file
`-o`	Output XO file

Note

You can replace --part-num and --clock-period with --platform to target a Vitis platform directly, for example:

--platform xilinx_u250_gen3x16_xdma_4_1_202210_1

HLS reports are written to work.out/report/ after synthesis completes.

Artifact produced: vecadd.xo

Stage 2 — Fast hardware simulation

Before waiting hours for a full bitstream, validate the RTL with TAPA's fast cosimulation. Pass the .xo file as the --bitstream argument:

./vadd --bitstream=vecadd.xo 1000

Fast cosim uses simplified models for external components (DRAM, AXI interconnect) so setup takes only a few seconds instead of the ten-plus minutes that Vitis cosimulation requires. A successful run prints PASS!.

Note

The default simulator backend is xsim, which requires Vivado on Linux. To use Verilator instead (cross-platform, no Vivado required), pass -cosim_simulator verilator to the host executable: ./vadd --bitstream=vadd.xo -cosim_simulator verilator.

Stage 3 — Link to xclbin

Use Vitis v++ to link the .xo into a hardware bitstream. This step does not involve TAPA and typically takes several hours:

v++ -o vadd.xilinx_u250_gen3x16_xdma_4_1_202210_1.hw.xclbin \
  --link \
  --target hw \
  --kernel VecAdd \
  --platform xilinx_u250_gen3x16_xdma_4_1_202210_1 \
  vecadd.xo

Artifact produced: vadd.xilinx_u250_gen3x16_xdma_4_1_202210_1.hw.xclbin

Warning

Hardware binary generation typically takes several hours. Plan accordingly, and ensure your machine will remain available for the full duration.

Stage 4 — On-board execution

With an Alveo card installed and XRT configured, run the host binary and point it at the generated xclbin:

./vadd --bitstream=vadd.xilinx_u250_gen3x16_xdma_4_1_202210_1.hw.xclbin

A successful on-board run prints PASS!, confirming the accelerator produced correct results on real hardware.

Next step

The Programming Model

Purpose: Understand the TAPA task-parallel programming model.

Prerequisites: Installation

TAPA bridges familiar sequential C++ to FPGA hardware parallelism. Rather than requiring users to write RTL directly, it lets them express computation as a graph of concurrently-running tasks communicating through typed streams and shared memory interfaces.

Why this exists

Writing FPGA accelerators traditionally requires either low-level RTL or fragile HLS pragmas that break when code is refactored. TAPA solves this by letting you describe the parallel structure of your design as a graph of C++ functions. The compiler turns that graph into RTL automatically, while the same code runs natively on a CPU for simulation. You get the productivity of C++ without giving up the ability to express fine-grained, concurrent hardware pipelines.

Mental model

A TAPA design is a directed graph of tasks connected by streams and memory interfaces. Scalars are passed as function arguments.

Host
 │  tapa::invoke(TopTask, bitstream, mmap_args...)
 ▼
Top-level task  ← no computation; spawns all leaf tasks
 ├── spawns ──> Leaf task A  (writes to stream S)
 │                            stream S
 ├── spawns ──> Leaf task B  (reads stream S, writes to stream T)
 │                            stream T
 └── spawns ──> Leaf task C  (reads stream T, writes to mmap)
                              mmap ──> DRAM

The host calls tapa::invoke, passing the kernel function, a bitstream path (empty for software simulation), and the kernel arguments.
The top-level task is the entry point synthesized by tapa compile. It declares streams as local objects, then spawns all leaf tasks and passes streams to them by reference. It contains no computation of its own.
Leaf tasks perform the actual computation. One leaf writes to a stream; another reads from it. Streams flow between leaf tasks — the top-level task is never the producer or consumer of stream data.

All child tasks spawned by tapa::task().invoke(...) run concurrently. The top-level task returns only after every child task has finished.

Minimal correct example

Kernel file (`vadd.cpp`)

The top-level task VecAdd declares three streams, then launches four leaf tasks that run in parallel:

void VecAdd(tapa::mmap<const float> a, tapa::mmap<const float> b,
            tapa::mmap<float> c, uint64_t n) {
  tapa::stream<float> a_q("a");
  tapa::stream<float> b_q("b");
  tapa::stream<float> c_q("c");

  tapa::task()
      .invoke(Mmap2Stream, a, n, a_q)
      .invoke(Mmap2Stream, b, n, b_q)
      .invoke(Add, a_q, b_q, c_q, n)
      .invoke(Stream2Mmap, c_q, c, n);
}

Host file (`vadd-host.cpp`)

The host calls tapa::invoke with the kernel function, the bitstream path, and the kernel arguments. When the bitstream path is empty (the default), TAPA runs software simulation:

#include <gflags/gflags.h>
#include <tapa.h>

DEFINE_string(bitstream, "", "Path to XO or xclbin file. Empty = software simulation.");

int main(int argc, char* argv[]) {
  gflags::ParseCommandLineFlags(&argc, &argv, true);

  std::vector<float, tapa::aligned_allocator<float>> a(n), b(n), c(n);
  // ... fill a and b ...

  int64_t kernel_time_ns = tapa::invoke(
      VecAdd, FLAGS_bitstream,
      tapa::read_only_mmap<const float>(a),
      tapa::read_only_mmap<const float>(b),
      tapa::write_only_mmap<float>(c),
      n);
}

The --bitstream flag is what controls which backend runs:

Omitted or empty → software simulation
.xo → fast cosimulation
.hw.xclbin → on-board execution

Rules

Host code and kernel code must live in separate files. The kernel file is compiled to RTL; the host file is compiled to a CPU executable.
The kernel file must contain exactly one top-level task — the function passed as --top to tapa compile.
The top-level task is called via tapa::invoke from the host; never called directly.
An upper-level task body must contain only stream declarations, tapa::task().invoke(...) chains, and scalar/mmap argument forwarding — no computation.
Streams are passed by reference (tapa::istream<T>&, tapa::ostream<T>&). Passing streams by value is a compile error.
mmap arguments are passed by value (tapa::mmap<T>), not by reference.
Scalar arguments (plain C++ types such as int, float, uint64_t) are passed by value and are read-only to the kernel. The kernel cannot communicate a result back to the host through a scalar parameter; use an mmap or stream instead.
Software simulation runs automatically when tapa::invoke receives an empty bitstream path.

Common mistakes

Wrong: calling the top-level task directly from host code

// WRONG — bypasses the TAPA runtime entirely; streams are not initialized,
// hardware execution cannot be dispatched.
VecAdd(tapa::mmap<const float>(a.data()), /* ... */);

Right: always use `tapa::invoke`

// RIGHT — works for software simulation, cosim, and on-board execution.
tapa::invoke(VecAdd, FLAGS_bitstream,
             tapa::read_only_mmap<const float>(a),
             tapa::read_only_mmap<const float>(b),
             tapa::write_only_mmap<float>(c),
             n);

tapa::invoke examines the bitstream path at runtime and dispatches to the correct backend: software simulation (empty path), RTL co-simulation (.xo), emulation (.hw_emu.xclbin), or on-board execution (.hw.xclbin).

The Compile Pipeline

Purpose: Understand the three-stage TAPA compile pipeline.

Prerequisites: The Programming Model

Each tapa subcommand maps to one pipeline stage. Knowing the stages helps diagnose failures, parallelize synthesis, and use remote execution correctly.

Why this exists

Compiling a TAPA design involves three distinct concerns: parsing C++ and extracting the task graph, synthesizing each task to RTL with Vitis HLS, and packaging the RTL into an .xo file for Vitis. Separating these stages lets you re-run only the parts that changed, run synthesis on a remote machine with Xilinx tools, and parallelize synthesis across tasks.

Mental model

C++ source
    │
    ▼  tapa analyze  (always local)
task graph JSON
    │
    ▼  tapa synth    (can run remotely, parallelizable with -j)
per-task RTL (Verilog)
    │
    ▼  tapa pack     (can run remotely)
.xo file
    ╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌ (TAPA boundary) ╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌
    │
    ▼  v++ --link    (Vitis, not TAPA)
.xclbin

tapa analyze — Runs tapa-cpp and tapacc locally. Reads your C++ source, resolves task boundaries, and writes a task graph JSON to the work directory. No vendor tools are required for this step.

tapa synth — Invokes Vitis HLS for each task to produce per-task Verilog RTL. This is the most time-consuming step. With -j N, up to N tasks are synthesized in parallel. With --remote-host, synthesis runs on a remote Linux machine that has Vitis HLS installed.

tapa pack — Combines the per-task RTL into a single Xilinx IP package (.xo file) suitable for v++ --link.

Shortcut: tapa compile runs all three stages in the correct order in a single command.

Minimal correct example

All-in-one (most common)

tapa compile \
  --top VecAdd \
  --part-num xcu250-figd2104-2L-e \
  --clock-period 3.33 \
  -f vadd.cpp \
  -o vecadd.xo

Use --platform instead of --part-num when targeting a full Vitis platform:

tapa compile \
  --top VecAdd \
  --platform xilinx_u250_gen3x16_xdma_4_1_202210_1 \
  --clock-period 3.33 \
  -f vadd.cpp \
  -o vecadd.xo

Running stages separately

--work-dir is a top-level tapa flag that applies to all subcommands. It must be the same across all three stages when running them separately (default: work.out).

Run tapa analyze first to extract the task graph (no vendor tools needed):

tapa --work-dir work.out analyze \
  --top VecAdd \
  -f vadd.cpp

Then run tapa synth to synthesize each task to RTL, optionally in parallel and/or on a remote host:

tapa --work-dir work.out synth \
  --part-num xcu250-figd2104-2L-e \
  --clock-period 3.33 \
  -j 4

Finally, run tapa pack to produce the .xo file:

tapa --work-dir work.out pack \
  -o vecadd.xo

Rules

tapa analyze always runs locally, even when --remote-host is set.
tapa synth and tapa pack run on the remote host when --remote-host is provided.
tapa compile is the shortcut for all three stages and handles stage ordering automatically.
The -j / --jobs flag on tapa synth controls how many Vitis HLS processes run in parallel. Keep it at or below the available core count on the synthesis machine.
--work-dir is a top-level flag: tapa --work-dir DIR <subcommand>.

Common mistakes

Wrong: running `tapa synth` before `tapa analyze`

# WRONG — the task graph JSON does not exist yet; tapa synth will fail
# with a missing file error.
tapa --work-dir work.out synth --part-num xcu250-figd2104-2L-e --clock-period 3.33

Right: always run `tapa analyze` first, or use `tapa compile`

# RIGHT — explicit ordering
tapa --work-dir work.out analyze --top VecAdd -f vadd.cpp
tapa --work-dir work.out synth   --part-num xcu250-figd2104-2L-e --clock-period 3.33
tapa --work-dir work.out pack    -o vecadd.xo

# RIGHT — shortcut that handles ordering automatically
tapa compile --top VecAdd --part-num xcu250-figd2104-2L-e \
             --clock-period 3.33 -f vadd.cpp -o vecadd.xo

Note about v++ link

Note

The v++ --link step that produces .xclbin is performed by Xilinx Vitis, not TAPA. TAPA's output is the .xo file. See Build & Run on Board for the full linking workflow.

Tasks

Purpose: Understand TAPA's three task types and their constraints.

Prerequisites: The Programming Model

Why this exists

TAPA organizes an FPGA accelerator as a hierarchy of C++ functions called tasks. This hierarchy lets the compiler assign each leaf task to an independent HLS module synthesized in parallel, while upper-level tasks provide the wiring between those modules. The result is a design whose parallel structure is explicit in the source code rather than inferred from pragmas.

Mental model

A TAPA design forms a tree of tasks:

Top-level task (entry point, kernel boundary)
├── Upper-level task (orchestration only)
│   ├── Leaf task A (computation)
│   └── Leaf task B (computation)
└── Leaf task C (computation)

Each level has a distinct role:

Leaf task — performs computation: loops, arithmetic, stream reads/writes. May call ordinary C++ functions. Must NOT invoke other TAPA tasks.
Upper-level task — orchestrates execution. Its body may only instantiate streams and invoke child tasks with tapa::task().invoke(...). It contains no computation of its own.
Top-level task — the kernel entry point invoked from the host via tapa::invoke. For the xilinx-vitis target (the default), the top-level task must itself be an upper-level task.

Minimal correct example

The VecAdd function from the vector-add example is a top-level upper-level task. It instantiates three streams, then invokes four leaf tasks:

void VecAdd(tapa::mmap<const float> a, tapa::mmap<const float> b,
            tapa::mmap<float> c, uint64_t n) {
  tapa::stream<float> a_q("a");
  tapa::stream<float> b_q("b");
  tapa::stream<float> c_q("c");

  tapa::task()
      .invoke(Mmap2Stream, a, n, a_q)
      .invoke(Mmap2Stream, b, n, b_q)
      .invoke(Add, a_q, b_q, c_q, n)
      .invoke(Stream2Mmap, c_q, c, n);
}

Mmap2Stream, Add, and Stream2Mmap are leaf tasks that each perform a specific computation. VecAdd contains no computation — only stream declarations and .invoke(...) calls.

Detached tasks

By default a parent task waits for all child tasks to finish before it terminates. A detached task is instead left running; the parent does not wait for it. This is useful for purely data-driven tasks that have no natural termination point (e.g., a constant data source or an infinite switch network).

tapa::task().invoke<tapa::detach>(LeafTask, arg1, arg2);

Detached tasks are similar to std::thread::detach in the C++ STL. Because their state does not need to be tracked, they avoid fan-out termination signals and reduce area.

Note

By default, TAPA tasks are joined: the parent waits for each child to complete. Use tapa::detach only when the child task genuinely does not need to terminate on program completion.

Rules

Leaf tasks receive streams by reference (istream<T>&, ostream<T>&) and mmap interfaces by value (mmap<T>).
An upper-level task body must contain only stream instantiations and .invoke(...) calls — no loops, arithmetic, or other computation.
async_mmap channel operations (read_addr, read_data, etc.) are leaf-task-only.
For the xilinx-vitis target (the default), the top-level task must be an upper-level task — it cannot be a leaf task.
Leaf templated tasks (template functions that compute directly) are supported. Non-leaf templated tasks that invoke other tasks are not yet supported.

Common mistakes

Wrong — computation inside an upper-level task body:

// Wrong: for loop makes this a leaf task, not an upper-level task
void BadUpper(tapa::mmap<float> mem, uint64_t n) {
  tapa::stream<float> q("q");
  for (uint64_t i = 0; i < n; ++i) {  // <-- computation here
    q.write(mem[i]);
  }
  tapa::task().invoke(Consumer, q, n);
}

Right — move computation into a dedicated leaf task:

void Loader(tapa::mmap<float> mem, uint64_t n, tapa::ostream<float>& q) {
  for (uint64_t i = 0; i < n; ++i) {
    q.write(mem[i]);
  }
}

void GoodUpper(tapa::mmap<float> mem, uint64_t n) {
  tapa::stream<float> q("q");
  tapa::task()
      .invoke(Loader, mem, n, q)
      .invoke(Consumer, q, n);
}

Streams

Purpose: Communicate between TAPA tasks using typed FIFO streams.

Prerequisites: Tasks

Why this exists

Streams are the primary inter-task communication mechanism in TAPA. They are typed, directional FIFOs that appear explicitly in task signatures, making data flow visible in the source code. Unlike shared memory, streams enforce a single-writer/single-reader discipline and make producer–consumer relationships unambiguous to both the programmer and the compiler.

Mental model

A stream instance lives in an upper-level task. Leaf tasks receive directional references to it:

// Upper-level task instantiates the stream and wires it to two leaf tasks
void Upper(/* ... */) {
  tapa::stream<float, 16> data_q("data_q");  // depth = 16 elements

  tapa::task()
      .invoke(Producer, data_q)   // Producer writes to data_q
      .invoke(Consumer, data_q);  // Consumer reads from data_q
}

// Leaf task signatures use directional references
void Producer(tapa::ostream<float>& out) { /* ... */ }
void Consumer(tapa::istream<float>& in)  { /* ... */ }

The stream<T, Depth> template parameter controls the hardware FIFO depth (default: 2). A larger depth reduces the chance of stalls at the cost of FPGA BRAM resources.

Blocking read and write

void Task(tapa::istream<int>& in, tapa::ostream<int>& out) {
  int data = in.read();   // blocks until data is available
  out.write(data);        // blocks until space is available
}

The << and >> operator aliases are equivalent:

out << data;   // same as out.write(data)
in >> data;    // same as data = in.read()

Non-blocking read and write

To read from multiple streams or achieve an initiation interval of one, use the non-blocking variants that return a bool indicating success:

void Task(tapa::istream<int>& in, tapa::ostream<int>& out) {
  int data;
  bool ok = in.try_read(data);   // returns false if stream is empty
  if (ok) {
    out.try_write(data);         // returns false if stream is full
  }
}

Readiness checks

Check stream state before committing to a read or write:

if (!in.empty())  { /* safe to read  */ }
if (!out.full())  { /* safe to write */ }

For non-destructive inspection, peek returns the front element and a validity flag without consuming it:

bool valid;
auto val = in.peek(valid);   // does not remove the token
if (valid && /* routing decision */) {
  in.read(nullptr);          // consume now
}

End-of-Transaction (EoT)

A producer signals the end of a data stream by calling close(). The consumer detects it with try_eot():

// Producer
void Mmap2Stream(tapa::mmap<const float> mem, uint64_t n,
                 tapa::ostream<float>& stream) {
  for (uint64_t i = 0; i < n; ++i) {
    stream.write(mem[i]);
  }
  stream.close();  // send EoT token
}

// Consumer
void Stream2Mmap(tapa::istream<float>& stream, tapa::mmap<float> mem) {
  for (uint64_t i = 0;;) {
    bool eot;
    if (stream.try_eot(eot)) {
      if (eot) break;
      mem[i++] = stream.read(nullptr);
    }
  }
}

EoT loop helper macros

TAPA provides macros that encapsulate the non-blocking EoT check pattern:

TAPA_WHILE_NOT_EOT(stream) — loops until stream delivers an EoT token; body executes only when a valid non-EoT token is available.
TAPA_WHILE_NEITHER_EOT(s1, s2) — loops until either stream delivers EoT; body executes only when both have a valid token.
TAPA_WHILE_NONE_EOT(s1, s2, s3) — three-stream variant.

void Consumer(tapa::istream<int>& in, tapa::ostream<int>& out) {
  TAPA_WHILE_NOT_EOT(in) {
    out.write(in.read(nullptr));
  }
  out.close();
}

Tip

A downstream task can reopen a closed stream with stream.open() to reuse it across multiple transactions.

Stream arrays

For parameterized designs, TAPA provides arrays of streams:

tapa::streams<T, N> — array of N streams (instantiation in upper-level task)
tapa::istreams<T, N>& / tapa::ostreams<T, N>& — directional array references in leaf task signatures

When invoking N parallel instances of a leaf task, use invoke<tag, N>(...) and TAPA distributes the array elements automatically:

void InnerStage(int b, tapa::istreams<pkt_t, kN / 2>& in_q0,
                tapa::istreams<pkt_t, kN / 2>& in_q1,
                tapa::ostreams<pkt_t, kN>& out_q) {
  tapa::task().invoke<tapa::detach, kN / 2>(Switch2x2, b, in_q0, in_q1, out_q);
}

Rules

Always pass streams by reference: istream<T>&, ostream<T>&. Never by value — the stream object is not copyable.
Each stream instance must have exactly one reader and exactly one writer.
TAPA software simulation respects stream depth: a full stream blocks the writer, matching hardware behavior.
Stream depth is a hardware FIFO size. The FPGA resource used depends on depth:
- Depth < 128: synthesised from SRL shift-registers (no BRAM cost).
- Depth ≥ 128: mapped to BRAM.
- Depth ≥ 4096 and element width ≥ 36 bits: mapped to URAM.
- Default depth is 2, which costs only SRL resources.

Common mistakes

Wrong — stream passed by value (drops the reference, triggers a copy):

void Leaf(tapa::istream<float> in) { /* ... */ }  // missing &

Right — stream passed by reference:

void Leaf(tapa::istream<float>& in) { /* ... */ }

Memory Access: mmap

Purpose: Access FPGA-adjacent DRAM from TAPA leaf tasks using mmap.

Prerequisites: Tasks

Why this exists

FPGA designs need to read from and write to off-chip DRAM. tapa::mmap<T> provides an array-like interface that TAPA compiles to AXI4 memory-mapped transactions. It is simpler to use than async_mmap and is the right choice when latency hiding is not required or when access patterns are straightforward.

Mental model

A leaf task receives mmap<T> by value and accesses it like a C array:

void Mmap2Stream(tapa::mmap<const float> mem, uint64_t n,
                 tapa::ostream<float>& stream) {
  for (uint64_t i = 0; i < n; ++i) {
    stream << mem[i];   // array subscript operator
  }
}

The upper-level task passes the mmap argument through to the leaf:

void VecAdd(tapa::mmap<const float> a, tapa::mmap<const float> b,
            tapa::mmap<float> c, uint64_t n) {
  tapa::stream<float> a_q("a");
  tapa::stream<float> b_q("b");
  tapa::stream<float> c_q("c");

  tapa::task()
      .invoke(Mmap2Stream, a, n, a_q)
      .invoke(Mmap2Stream, b, n, b_q)
      .invoke(Add, a_q, b_q, c_q, n)
      .invoke(Stream2Mmap, c_q, c, n);
}

Minimal correct example

Mmap2Stream from the vector-add example reads from a read-only mmap and writes the values into a stream:

void Mmap2Stream(tapa::mmap<const float> mmap, uint64_t n,
                 tapa::ostream<float>& stream) {
  for (uint64_t i = 0; i < n; ++i) {
    stream << mmap[i];
  }
}

Note that mmap is passed by value (no &).

Host-side wrappers

On the host, the direction of host-to-kernel data transfer is declared in the tapa::invoke call using wrapper types:

tapa::read_only_mmap<T>(vec) — host sends data to the kernel; kernel reads
tapa::write_only_mmap<T>(vec) — kernel writes; host receives data back
tapa::read_write_mmap<T>(vec) — bidirectional transfer

Warning

read_only_mmap and write_only_mmap describe the host-to-kernel transfer direction, not the kernel's internal access pattern. The kernel task always receives a plain mmap<T> parameter regardless of which wrapper was used.

From the vector-add host code:

tapa::invoke(
    VecAdd, FLAGS_bitstream,
    tapa::read_only_mmap<const float>(a),
    tapa::read_only_mmap<const float>(b),
    tapa::write_only_mmap<float>(c),
    n);

Aligned allocator

If the host std::vector is not page-aligned, the TAPA runtime must make an extra copy when transferring data to the FPGA. Use tapa::aligned_allocator<T> to avoid this:

std::vector<float, tapa::aligned_allocator<float>> a(n);
std::vector<float, tapa::aligned_allocator<float>> b(n);
std::vector<float, tapa::aligned_allocator<float>> c(n);

This eliminates the extra copy and suppresses XRT alignment warnings.

Shared mmap

The same mmap argument can be passed to multiple child tasks. TAPA inserts an AXI interconnect so both tasks share the same AXI port:

void Load(tapa::mmap<float> srcs, uint64_t n,
          tapa::ostream<float>& a, tapa::ostream<float>& b) {
  tapa::task()
      .invoke(Mmap2Stream, srcs, 0, n, a)
      .invoke(Mmap2Stream, srcs, 1, n, b);
}

Warning

When a mmap is shared across tasks, the programmer is responsible for memory consistency. Concurrent accesses to the same addresses will produce undefined results.

mmap arrays

For parameterized designs with multiple independent memory ports:

tapa::mmaps<T, N> — array of N mmap interfaces (kernel side)
tapa::read_only_mmaps<T, N> / tapa::write_only_mmaps<T, N> / tapa::read_write_mmaps<T, N> — directional wrappers for tapa::invoke on the host side

// Host side
tapa::invoke(VecAdd, FLAGS_bitstream,
             tapa::read_only_mmaps<float, M>(a),
             tapa::read_only_mmaps<float, M>(b),
             tapa::write_only_mmaps<float, M>(c), n);

// Kernel side
void VecAdd(tapa::mmaps<float, M> a, tapa::mmaps<float, M> b,
            tapa::mmaps<float, M> c, uint64_t n) { /* ... */ }

Rules

Kernel task signatures: mmap<T> must be passed by value (no &). This is the opposite of streams.
mmap can only be used as a function parameter, not as a local variable.
read_only_mmap / write_only_mmap describe host-to-kernel transfer direction only; they do not constrain kernel access patterns.

Common mistakes

Wrong — mmap passed by reference:

void Kernel(tapa::mmap<float>& mem) { /* ... */ }  // & is wrong

Right — mmap passed by value:

void Kernel(tapa::mmap<float> mem) { /* ... */ }

Memory Access: async_mmap

Purpose: Use async_mmap to overlap DRAM access latency with computation.

Prerequisites: Memory Access: mmap

Why this exists

mmap does not provide explicit control over outstanding DRAM transactions. The HLS tool may issue burst transactions for sequential access, but for random-access patterns or designs that need fine-grained control over outstanding requests, the lack of explicit flow control limits throughput. Off-chip DRAM latency is typically 100–200 ns, and without the ability to overlap request issuance with data receipt, achievable bandwidth stays far below the channel peak.

async_mmap exposes the five AXI channels as individual streams, letting you issue multiple outstanding requests and overlap address issuance with data receipt. The result is much higher DRAM throughput — especially for random access — and significantly lower area overhead compared to the Vitis HLS m_axi interface.

Mental model: five AXI channels

async_mmap<T> is a struct whose fields are streams corresponding to the five AXI channels:

template <typename T>
struct async_mmap {
  using addr_t = int64_t;
  using resp_t = uint8_t;

  tapa::ostream<addr_t> read_addr;   // issue read addresses
  tapa::istream<T>      read_data;   // receive read data
  tapa::ostream<addr_t> write_addr;  // issue write addresses
  tapa::ostream<T>      write_data;  // send write data
  tapa::istream<resp_t> write_resp;  // receive write acknowledgments
};

async_mmap diagram

The key insight is that read_addr and read_data are decoupled: you can issue many addresses into read_addr before any data arrives on read_data, hiding latency by keeping multiple requests in flight simultaneously.

Minimal correct example

The pattern for overlapping read requests and responses in a single pipelined loop:

void ReadKernel(tapa::async_mmap<float>& mem, float* result,
                uint64_t n) {
  for (int i_req = 0, i_resp = 0; i_resp < n;) {
#pragma HLS pipeline II=1
    // Issue a read address if the channel has space
    if (i_req < n && mem.read_addr.try_write(i_req)) {
      ++i_req;
    }
    // Consume a read response if data is available
    if (!mem.read_data.empty()) {
      result[i_resp] = mem.read_data.read(nullptr);
      ++i_resp;
    }
  }
}

Two loop counters (i_req, i_resp) track outstanding requests. Because both checks are non-blocking, the loop can issue a new address and receive a response in the same clock cycle.

Runtime burst detection

TAPA coalesces sequential addresses into AXI bursts automatically at runtime. You only need to issue individual element-by-element addresses; TAPA's generated hardware merges adjacent requests into larger burst transactions dynamically. This provides burst efficiency for sequential patterns without requiring static analysis or explicit burst programming in your kernel code.

Area comparison

async_mmap uses significantly fewer FPGA resources than the Vitis HLS m_axi interface, which is important for HBM devices that expose many memory channels:

Memory Interface	Clock (MHz)	LUT	FF	BRAM	URAM	DSP
`#pragma HLS interface m_axi`	300	1189	3740	15	0	0
`async_mmap`	300	1466	162	0	0	0

async_mmap uses no BRAM and drastically fewer flip-flops, at the cost of slightly more LUTs for the burst-detection logic.

Rules

async_mmap<T> must be passed by reference (async_mmap<T>&). Passing by value is deprecated.
Channel operations (try_read/try_write on the five streams) are leaf-task only. An upper-level task may accept and forward an async_mmap<T>& parameter to a child leaf task without operating on it.
An mmap<T> argument can be passed to an async_mmap<T>& parameter — mmap is automatically promoted.
Only non-blocking operations (try_read, try_write) should be used on async_mmap channels inside pipelined loops.

Warning

Never use blocking read/write on async_mmap channels inside a pipelined loop. Blocking operations prevent other channel progress and cause deadlock. Always use try_read and try_write.

Common mistakes

Wrong — async_mmap passed by value (deprecated):

void Kernel(tapa::async_mmap<float> mem) { /* ... */ }  // missing &

Right — async_mmap passed by reference:

void Kernel(tapa::async_mmap<float>& mem) { /* ... */ }

Wrong — blocking read inside a pipelined loop:

// Wrong: blocks until data arrives, preventing address issuance
float val = mem.read_data.read();

Right — non-blocking read with availability check:

float val;
if (mem.read_data.try_read(val)) {
  // process val
}

Software Simulation

Purpose: Run software simulation to verify your TAPA design's logic without FPGA hardware.

When to use this: Before synthesizing — software simulation is fast (seconds) and requires only a C++ compiler and the TAPA library.

What you need

A compiled TAPA host executable (produced by tapa g++)
No FPGA, no Vivado, no XRT required

Commands

Run the executable with no --bitstream argument. TAPA detects the missing argument and runs the software simulation:

./vadd

For reproducible output when debugging ordering-sensitive behavior, pin the simulation to a single thread:

TAPA_CONCURRENCY=1 ./vadd

Note

TAPA_CONCURRENCY defaults to the physical CPU core count. Set it to 1 for reproducible task scheduling at the cost of simulation speed.

Expected output

I20000101 00:00:00.000000 0000000 task.h:66] running software simulation with TAPA library
kernel time: 1.19429 s
PASS!

The log line confirms the software simulation path was taken. PASS! is printed by the application when its correctness check succeeds.

Stream logging

To capture the values flowing through every tapa::stream channel, set TAPA_STREAM_LOG_DIR before running:

TAPA_STREAM_LOG_DIR=/tmp/logs ./vadd

TAPA writes one log file per stream. The format depends on the element type:

Primitive types (int, float, …) are logged as human-readable text, one value per line. For example, writing 42 to a tapa::stream<int> produces 42\n.
Non-primitive types without operator<< are logged in hex with little-endian byte order. For example, writing Foo{0x4222} to a tapa::stream<Foo> produces 0x22420000\n.
Non-primitive types with operator<< defined are logged using that operator, producing human-readable text.

Why coroutine simulation is more accurate than Vitis HLS simulation

Vitis HLS software simulation runs each task sequentially in a single thread. The tasks take turns executing to completion before the next one starts. This means races between concurrent tasks are invisible — the simulation passes even when tasks make assumptions about each other's execution order that will not hold in real hardware.

TAPA uses coroutine-based simulation: all tasks run on the same thread but yield cooperatively at stream blocking points. When a task calls read() on an empty stream, it suspends and another task runs. This models the concurrent, backpressure-driven semantics of hardware much more faithfully. Bugs that manifest in hardware because two tasks execute simultaneously are far more likely to surface during TAPA software simulation than during Vitis HLS software simulation.

This is also why TAPA enforces stream depth in software simulation: a producer that fills a depth-2 FIFO will block in TAPA simulation, just as it would in hardware.

Debugging with GDB

Software simulation runs as ordinary host code, so GDB works as normal:

gdb ./vadd

Then set a breakpoint on any TAPA task function by name:

(gdb) b VecAdd
(gdb) run

Breakpoints, watchpoints, and backtraces all work because every task runs as a coroutine on the host CPU.

Validation

Simulation is correct when:

The program exits with code 0.
The application's own correctness check prints PASS! (or your application's equivalent).
No deadlock or hang occurs within the expected runtime.

If something goes wrong

Warning

If the simulation hangs indefinitely, a stream deadlock is likely. See Deadlocks & Hangs for diagnosis steps.

For unexpected errors or assertion failures, see Common Errors.

Next step: Fast Hardware Simulation

Fast Hardware Simulation

Purpose: Validate RTL correctness faster than Vitis cosimulation using TAPA's fast cosim.

When to use this: After tapa compile produces a .xo file, before the multi-hour v++ --link step. Fast cosim catches logic bugs in generated RTL in seconds rather than the ten-plus minutes Vitis cosimulation requires.

What you need

A .xo kernel object from tapa compile (or a .zip for the xilinx-hls target)
One of:
- xsim: Requires a Vivado installation. Linux only.
- verilator: Open-source. Works on Linux and macOS. No Vivado required.

Commands

Basic run

Pass the .xo file as the --bitstream argument:

./vadd --bitstream VecAdd.xo 1000

For the xilinx-hls target, a .zip file also works:

./vadd --bitstream VecAdd.zip 1000

Choosing a simulator backend

The default backend is xsim. To switch to Verilator:

./vadd --bitstream VecAdd.xo -cosim_simulator verilator 1000

Saving waveforms

Specify a persistent work directory and enable waveform saving:

./vadd --bitstream VecAdd.xo \
    -cosim_work_dir ./cosim_work \
    -xsim_save_waveform \
    1000

Warning

Strongly recommended: pair -xsim_save_waveform with -cosim_work_dir. Without a persistent work directory, fast cosim uses a temporary directory that is deleted at exit, removing any saved waveforms with it.

Setup-only and resume workflow

When you want to inspect the generated simulation environment before committing to a full run:

# Step 1: set up the simulation environment and stop before running
./vadd --bitstream VecAdd.xo \
    -cosim_work_dir ./cosim_work \
    -cosim_setup_only \
    1000

# Step 2: after inspecting, run post-simulation checks without re-simulating
./vadd --bitstream VecAdd.xo \
    -cosim_work_dir ./cosim_work \
    -cosim_resume_from_post_sim \
    1000

Parallel runs

When a host application calls tapa::invoke more than once — for example, a pipeline split into separate kernels each compiled to its own .xo file — TAPA launches all cosim instances concurrently. Each kernel is compiled independently and its .xo path is passed to its own tapa::invoke call via a separate bitstream flag:

// Host code: two separate kernels, each with its own bitstream flag
DEFINE_string(producer_bitstream, "", "XO for Producer kernel");
DEFINE_string(consumer_bitstream, "", "XO for Consumer kernel");

tapa::invoke(Producer, FLAGS_producer_bitstream, ...);
tapa::invoke(Consumer, FLAGS_consumer_bitstream, ...);

./app --producer_bitstream=producer.xo --consumer_bitstream=consumer.xo

If all instances share the same -cosim_work_dir, their simulation environments collide. Pass -cosim_work_dir_parallel to give each instance its own uniquely named subdirectory:

./app \
    --producer_bitstream=producer.xo \
    --consumer_bitstream=consumer.xo \
    -cosim_work_dir ./cosim_work \
    -cosim_work_dir_parallel

TAPA creates ./cosim_work/XXXXXX/ (a unique name per instance) so that the simulations run without interfering with each other's build artifacts.

Runtime flags reference

The following flags control fast cosim behavior when passed to the host executable. The canonical reference is Runtime Flags.

Flag	Description
`-cosim_executable <path>`	Deprecated. Fast cosim now runs in-process via `libfrt`; this flag is ignored.
`-xsim_part_num <part>`	Target FPGA part number for simulation (e.g., `xcu280-fsvh2892-2L-e`).
`-cosim_work_dir <dir>`	Persistent working directory for simulation artifacts. Without this, a temporary directory is used and deleted after the run.
`-xsim_save_waveform`	Save simulation waveforms to a `.wdb` file in the work directory. Requires `-cosim_work_dir`.
`-xsim_start_gui`	Open the Vivado GUI for interactive debugging during simulation.
`-cosim_simulator <backend>`	Simulator backend: `xsim` (default, Linux only) or `verilator` (cross-platform).
`-cosim_setup_only`	Run simulation setup only, then stop before executing the simulation.
`-cosim_resume_from_post_sim`	Skip re-running the simulation; jump directly to post-simulation checks.
`-cosim_work_dir_parallel`	Create a unique subdirectory per instance when running concurrent simulations.

Expected output

Fast cosim completes in seconds for simple designs. A successful run prints the application's correctness result (e.g., PASS!) after the simulation finishes.

Debugging frozen simulations

If the simulation becomes unresponsive:

Run with -cosim_work_dir to persist intermediate files.
Abort the simulation with Ctrl-C.
Locate [work-dir]/output/run/run_cosim.tcl.

Open Vivado in GUI mode and source the script:

vivado -mode gui -source [work-dir]/output/run/run_cosim.tcl

This allows real-time observation and waveform analysis of the frozen state.

Warning

Cross-channel access for HBM is not currently supported in fast cosimulation. Each AXI interface can only access one HBM channel.

If something goes wrong

Warning

See Cosimulation Issues for diagnosis steps covering xsim hangs, Verilator build errors, and waveform debugging.

Next step: Vitis Cosimulation

Parallel RTL Emulation

Purpose: Run cycle-accurate RTL simulation for each kernel module concurrently, reducing total cosim time while preserving cycle-accurate behavior where it matters.

RTL cosimulation gives you cycle-accurate behavior for the logic inside each kernel — pipeline depths, stall conditions, II violations, and hazards that software simulation cannot catch. It does not give you cycle-accurate behavior between kernels: the FIFOs connecting separate cosim processes are shared-memory queues, and memory (mmap/async_mmap) latency is similarly abstracted. Parallel RTL emulation is therefore most valuable for validating the cycle-sensitive internals of individual kernels, not end-to-end timing across the full datapath.

Running one cosim process per kernel and launching them concurrently reduces wall-clock time compared to simulating everything in a single process or sequentially.

Concept

In a standard TAPA design, one top-level function is compiled into one .xo and the entire design is simulated as a single cosim process. In the parallel emulation pattern:

Each kernel function is compiled to its own .xo with tapa compile --top <KernelFunc>.
The host application defines a separate bitstream flag per kernel and passes each to .invoke() wrapped in tapa::executable.
tapa::task launches all kernel simulations concurrently; streams between kernels communicate through shared memory files managed by the runtime.

┌────────────────────────────────────────────────────────┐
│  Host application                                      │
│                                                        │
│  tapa::task()                                          │
│    .invoke(KernelA, tapa::executable(FLAGS_a_bs), ...) │──▶ cosim process A
│    .invoke(KernelB, tapa::executable(FLAGS_b_bs), ...) │──▶ cosim process B
│    .invoke(KernelC, tapa::executable(FLAGS_c_bs), ...) │──▶ cosim process C
└────────────────────────────────────────────────────────┘
         streams between kernels → shared-memory FIFOs (not cycle-accurate)

API

`tapa::executable`

Wraps a path to a kernel .xo (or .zip for the xilinx-hls target). When passed as the second argument to .invoke(), the runtime launches RTL emulation for that invocation instead of running it in software simulation.

class executable {
 public:
  explicit executable(std::string path);
  // Not copyable or movable.
};

If the path is empty, .invoke() falls back to software simulation for that kernel. This lets a single binary select simulation or emulation per-kernel at runtime.

`tapa::task::invoke` with `tapa::executable`

// Kernel-specific override: run KernelFunc from the given XO file.
task& invoke(Func&& func, tapa::executable exe, Args&&... args);

All .invoke() calls in a tapa::task() chain start concurrently. Kernels that receive a tapa::executable each get their own cosim process; kernels without one run as software coroutines.

Note

tapa::executable must be provided before any argument that is a direct stream reader or writer. The runtime uses the executable path to bind the right simulation backend before it can connect streams.

Compiling Each Kernel

Each kernel function is compiled independently. Invoke tapa compile once per top function, passing its name via --top:

tapa compile \
  --top Scatter \
  --part-num xcu280-fsvh2892-2L-e \
  --clock-period 3.33 \
  -f cannon.cpp \
  -o scatter.xo

tapa compile \
  --top ProcElem \
  --part-num xcu280-fsvh2892-2L-e \
  --clock-period 3.33 \
  -f cannon.cpp \
  -o proc-elem.xo

tapa compile \
  --top Gather \
  --part-num xcu280-fsvh2892-2L-e \
  --clock-period 3.33 \
  -f cannon.cpp \
  -o gather.xo

All three compilations can share the same source file. Each produces an independent .xo that knows only its own top function's interface.

Host Code

The host application follows the standard TAPA pattern, but uses one DEFINE_string per kernel rather than a single --bitstream flag:

#include <gflags/gflags.h>
#include <tapa.h>

DEFINE_string(scatter_bitstream, "",
              "path to Scatter XO; empty = software simulation");
DEFINE_string(proc_elem_bitstream, "",
              "path to ProcElem XO; empty = software simulation");
DEFINE_string(gather_bitstream, "",
              "path to Gather XO; empty = software simulation");

int main(int argc, char* argv[]) {
  gflags::ParseCommandLineFlags(&argc, &argv, true);
  // ... allocate buffers ...

  tapa::invoke(TopFunction, /*bitstream=*/"",
               tapa::read_only_mmap<const float>(a),
               tapa::read_only_mmap<const float>(b),
               tapa::write_only_mmap<float>(c), n);
}

The TopFunction assembles the task graph. Each .invoke() receives its own tapa::executable:

void TopFunction(tapa::mmap<const float> a_vec,
                 tapa::mmap<const float> b_vec,
                 tapa::mmap<float> c_vec, uint64_t n) {
  tapa::streams<float, 4> a("a");
  tapa::streams<float, 4> b("b");
  tapa::streams<float, 4> c("c");
  // ... declare inter-kernel streams ...

  tapa::task()
      .invoke(Scatter, tapa::executable(FLAGS_scatter_bitstream), a_vec, a)
      .invoke(Scatter, tapa::executable(FLAGS_scatter_bitstream), b_vec, b)
      .invoke(ProcElem, tapa::executable(FLAGS_proc_elem_bitstream), a, b, c, ...)
      // ... more ProcElem instances ...
      .invoke(Gather, tapa::executable(FLAGS_gather_bitstream), c_vec, c);
}

Streams declared inside TopFunction are host-side objects. The runtime passes references to the same shared-memory FIFO to each cosim process that reads or writes it, so data flows between kernels exactly as it would on hardware.

Running

Pass the compiled .xo files to the host binary:

./cannon \
    --scatter_bitstream=scatter.xo \
    --proc_elem_bitstream=proc-elem.xo \
    --gather_bitstream=gather.xo

When any flag is empty the corresponding kernel runs in software simulation. This lets you emulate a subset of the design while the rest runs in simulation:

# Only emulate ProcElem; Scatter and Gather run in software simulation.
./cannon --proc_elem_bitstream=proc-elem.xo

Work directory

By default each cosim process writes to a temporary directory that is deleted at exit. Provide -cosim_work_dir to retain artifacts. When multiple kernels share the same work directory their simulation environments collide; use -cosim_work_dir_parallel to give each process a unique subdirectory:

./cannon \
    --scatter_bitstream=scatter.xo \
    --proc_elem_bitstream=proc-elem.xo \
    --gather_bitstream=gather.xo \
    -cosim_work_dir ./cosim_work \
    -cosim_work_dir_parallel

TAPA creates ./cosim_work/XXXXXX/ (a unique name per instance) so the simulations do not interfere with each other.

Simulator backend

The same -cosim_simulator flag applies to all instances:

./cannon \
    --scatter_bitstream=scatter.xo \
    --proc_elem_bitstream=proc-elem.xo \
    --gather_bitstream=gather.xo \
    -cosim_simulator verilator

Controlling concurrency

Set TAPA_CONCURRENCY to limit how many cosim processes run simultaneously. This is useful on machines with limited memory:

TAPA_CONCURRENCY=1 ./cannon \
    --scatter_bitstream=scatter.xo \
    --proc_elem_bitstream=proc-elem.xo \
    --gather_bitstream=gather.xo

At TAPA_CONCURRENCY=1 the processes still exchange data correctly through shared-memory FIFOs, but only one simulation runs at a time.

Runtime flags reference

Flag	Description
`-cosim_work_dir <dir>`	Persistent working directory for simulation artifacts.
`-cosim_work_dir_parallel`	Create a unique subdirectory per instance. Required when multiple kernels share `-cosim_work_dir`.
`-cosim_simulator <backend>`	`xsim` (default, Linux only) or `verilator` (cross-platform). Applied to all instances.
`-xsim_save_waveform`	Save simulation waveforms. Pair with `-cosim_work_dir`.
`-cosim_executable <path>`	Deprecated. Fast cosim now runs in-process via `libfrt`; this flag is ignored.
`-xsim_part_num <part>`	Target FPGA part number (e.g., `xcu280-fsvh2892-2L-e`).
`TAPA_CONCURRENCY`	Environment variable. Limits the number of cosim processes that run simultaneously.

Full example: Cannon matrix multiply

The tests/functional/parallel-emulation/ directory in the TAPA repository contains a working parallel-emulation example. The Cannon algorithm splits into three kernels:

Kernel	Role
`Scatter` (×2)	Distributes rows of matrices A and B into per-PE stream arrays
`ProcElem` (×p²)	Each PE computes its sub-matrix tile and shifts blocks to neighbours
`Gather` (×1)	Collects results from all PEs into the output matrix

Compile (three invocations from one source file):

tapa compile --top Scatter  -f cannon.cpp -o scatter.xo   --part-num xcu280-fsvh2892-2L-e --clock-period 3.33
tapa compile --top ProcElem -f cannon.cpp -o proc-elem.xo --part-num xcu280-fsvh2892-2L-e --clock-period 3.33
tapa compile --top Gather   -f cannon.cpp -o gather.xo    --part-num xcu280-fsvh2892-2L-e --clock-period 3.33

Run:

./cannon-host \
    --scatter_bitstream=scatter.xo \
    --proc_elem_bitstream=proc-elem.xo \
    --gather_bitstream=gather.xo \
    -cosim_work_dir ./cosim_work \
    -cosim_work_dir_parallel

A successful run prints PASS! after all simulation processes finish.

See also: Fast Hardware Simulation — single-kernel cosim with the same -cosim_* and -xsim_* flags.

Vitis Cosimulation

Purpose: Run full Vitis hardware emulation for accurate timing after fast cosim passes.

When to use this: When you need accurate timing or bandwidth numbers that fast cosim cannot provide. This step is slow (5–10 minutes for simple designs) and is rarely the first choice — run Fast Hardware Simulation first to catch logic errors.

What you need

A .xo kernel object from tapa compile
Vitis and XRT installed (Linux only)
The target platform string (e.g., xilinx_u280_xdma_201920_3)

Commands

Generate the hardware emulation bitstream

platform=xilinx_u280_xdma_201920_3

v++ -o vadd.$platform.hw_emu.xclbin \
  --link \
  --target hw_emu \
  --kernel VecAdd \
  --platform $platform \
  vadd.$platform.hw.xo

Replace $platform with your actual target platform string and VecAdd with your top-level kernel name. This step typically takes 5–10 minutes.

Run the hardware emulation

./vadd --bitstream=vadd.$platform.hw_emu.xclbin 1000

The same host executable used for software simulation and fast cosim runs unchanged here — only the --bitstream argument changes.

Expected output

INFO: Loading vadd.xilinx_u250_xdma_201830_2.hw_emu.xclbin
INFO: Found platform: Xilinx
INFO: Found device: xilinx_u250_xdma_201830_2
INFO: Using xilinx_u250_xdma_201830_2
INFO: [HW-EMU 01] Hardware emulation runs simulation underneath. Using a large data set will result in long simulation times. It is recommended that a small dataset is used for faster execution. The flow uses approximate models for DDR memory and interconnect and hence the performance data generated is approximate.
...
INFO: [HW-EMU 06-0] Waiting for the simulator process to exit
INFO: [HW-EMU 06-1] All the simulator processes exited successfully
elapsed time: 31.0901 s
PASS!

Note

Vitis hardware emulation uses approximate models for DDR memory and interconnects. Performance numbers from hw_emu are indicative, not exact. For precise measurements, run on an actual board using an hw bitstream.

Validation

The run is correct when:

The INFO: [HW-EMU 06-1] All the simulator processes exited successfully line appears.
The application's correctness check prints PASS!.
The elapsed time is reported (confirming the kernel actually executed).

Tip

Use a small dataset for hardware emulation runs. Large datasets cause proportionally long simulation times because every clock cycle is simulated in software.

If something goes wrong

Warning

See Cosimulation Issues for diagnosis steps. Common issues include missing XRT environment variables, platform string mismatches, and kernel name mismatches between the --kernel flag and the TAPA top-level function name.

Next step: Build & Run on Board

Build & Run on Board

Purpose: Build a TAPA design into an FPGA bitstream and run it on an Alveo board.

When to use this: After fast cosim (and optionally Vitis cosim) passes — this step converts your .xo kernel object into a hardware bitstream and executes it on real silicon.

What you need

A .xo kernel object from tapa compile
Vitis and XRT installed (Linux only)
The target platform string (e.g., xilinx_u280_xdma_201920_3)
An Alveo board installed in the system for the final execution step
Several hours of compute time for v++ --link

Stage 1: Compile the kernel with TAPA

If you do not already have a .xo, produce it with tapa compile:

platform=xilinx_u280_xdma_201920_3

tapa \
  --work-dir work.out \
  compile \
  --top VecAdd \
  --part-num xcu280-fsvh2892-2L-e \
  --clock-period 3.33 \
  -f vadd.cpp \
  -o vadd.$platform.hw.xo

The .xo file is the artifact that feeds v++.

Stage 2: Link into an FPGA bitstream

v++ -o vadd.$platform.hw.xclbin \
  --link \
  --target hw \
  --kernel VecAdd \
  --platform $platform \
  vadd.$platform.hw.xo

Warning

This step takes several hours depending on design complexity and host machine performance. Plan accordingly and consider running it on a dedicated build server (see Remote Execution).

The output artifact is vadd.$platform.hw.xclbin — this is the bitstream loaded onto the FPGA.

Key alignment rules:

--kernel VecAdd must match the top-level function name in your TAPA source.
--platform $platform must match the platform string used in tapa compile --part-num.
The input .xo filename (vadd.$platform.hw.xo) must be the file produced by tapa compile.

Stage 3: Execute on the FPGA

The same host executable used for software and hardware simulation runs on board:

./vadd --bitstream=vadd.$platform.hw.xclbin

Expected output

INFO: Found platform: Xilinx
INFO: Found device: xilinx_u280_xdma_201920_3
INFO: Using xilinx_u280_xdma_201920_3
...
elapsed time: 7.48926 s
PASS!

On-board execution is substantially faster than hardware emulation. The elapsed time includes FPGA reconfiguration time (loading the bitstream).

Validation

The run is correct when:

XRT finds and selects the expected device.
The elapsed time is reported.
The application's correctness check prints PASS!.

Tip

If you use std::vector for memory-mapped buffers, XRT may warn about unaligned host pointers, which causes an extra memory copy. To eliminate the copy, use std::vector<T, tapa::aligned_allocator<T>> instead.

If something goes wrong

Warning

See Common Errors for diagnosis steps. Common issues include XRT not finding the device, platform string mismatches, and bitstream generated for a different platform than the installed board.

Next step: Remote Execution

Remote Execution

Purpose: Offload TAPA vendor-tool steps to a remote Linux machine over SSH.

When to use this: When your development machine is macOS (where Xilinx/AMD tools are unavailable) or when you want to delegate long-running HLS synthesis and implementation steps to a dedicated Linux build server.

What you need

SSH access to a Linux machine with Vitis HLS and/or Vivado installed
The path to settings64.sh on the remote machine
TAPA installed locally (the tapa analyze step always runs locally)

How remote execution works

TAPA splits work between local and remote:

Step	Runs where
`tapa analyze` (runs `tapa-cpp` and `tapacc`)	Always local
`tapa synth` (Vitis HLS synthesis)	Remote when `--remote-host` is set
`tapa pack` (IP packaging)	Remote when `--remote-host` is set
Host fast-cosim runtime (`--bitstream=*.xo`)	Remote when `--remote-host` is set
File transfer (`.xo`, `.zip` artifacts)	Handled automatically by TAPA

Commands

Inline remote flags

tapa \
  --work-dir work.out \
  --remote-host alice@build-server.example.com:22 \
  --remote-key-file ~/.ssh/id_ed25519 \
  --remote-xilinx-settings /opt/Xilinx/Vitis/2024.1/settings64.sh \
  compile \
  --top VecAdd \
  --part-num xcu280-fsvh2892-2L-e \
  --clock-period 3.33 \
  -f vadd.cpp \
  -o vadd.xo

Parallel HLS jobs on the remote host

Use -j to run up to N Vitis HLS processes in parallel on the remote machine:

tapa \
  --work-dir work.out \
  --remote-host alice@build-server.example.com \
  --remote-xilinx-settings /opt/Xilinx/Vitis/2024.1/settings64.sh \
  synth \
  -j 8 \
  ...

Note

TAPA_CONCURRENCY and -j are different controls:

TAPA_CONCURRENCY controls the number of parallel software-simulation threads used by the host runtime during functional simulation (tapa::invoke with no bitstream). It has no effect on HLS or remote execution.
-j (passed to tapa synth) controls how many Vitis HLS processes run in parallel on the remote host.

Keep -j at or below the number of cores available on the remote machine.

Reusing the SSH connection

To avoid establishing a new TCP connection on every tapa invocation, use connection multiplexing with a persistent socket directory:

tapa \
  --work-dir work.out \
  --remote-host alice@build-server.example.com \
  --remote-ssh-control-dir ~/.ssh/tapa-mux \
  --remote-ssh-control-persist 4h \
  --remote-xilinx-settings /opt/Xilinx/Vitis/2024.1/settings64.sh \
  compile \
  ...

The master connection stays alive for 4 hours after the last client closes. Subsequent tapa invocations within that window reuse the existing TCP connection.

Remote flags reference

Flag	Description
`--remote-host user@host[:port]`	Remote Linux host for vendor tools. Omit user to use the current local username; omit port to use 22.
`--remote-key-file PATH`	SSH private key for authentication. Defaults to the SSH agent or `~/.ssh/id_rsa`.
`--remote-xilinx-settings PATH`	Path to `settings64.sh` on the remote host. TAPA sources this before invoking Vitis HLS.
`--remote-ssh-control-dir DIR`	Local directory for OpenSSH multiplex control sockets. Share across invocations to reuse the master connection.
`--remote-ssh-control-persist DURATION`	How long the master socket stays alive after the last connection closes (e.g., `30m`, `4h`). Default: `30m`.
`--remote-disable-ssh-mux`	Disable SSH connection multiplexing. Each SSH/SCP call opens a fresh connection. Use this when the remote host or a proxy does not support `ControlMaster`.

Persistent configuration via `~/.taparc`

Instead of repeating remote flags on every invocation, store them in ~/.taparc:

remote:
  host: build-server.example.com
  user: alice
  port: 22
  key_file: ~/.ssh/id_ed25519
  xilinx_settings: /opt/Xilinx/Vitis/2024.1/settings64.sh
  work_dir: /tmp/tapa-remote
  ssh_control_dir: ~/.ssh/tapa-mux
  ssh_control_persist: 4h
  ssh_multiplex: true

CLI flags always override the corresponding ~/.taparc values. In particular, --remote-host replaces the host, user, and port fields from the config file.

Validation

After a successful remote compile, the .xo artifact is automatically transferred back to your local machine. Check for it:

ls -lh vadd.xo

TAPA prints transfer progress and the remote Vitis HLS log to standard output during the run.

If something goes wrong

Warning

SSH connection refused or timeout: Verify the host, port, and that your key is accepted with ssh -i ~/.ssh/id_ed25519 alice@build-server.example.com.

settings64.sh not found: Confirm the path is correct on the remote machine with ssh alice@build-server.example.com ls /opt/Xilinx/Vitis/2024.1/settings64.sh.

ControlMaster errors: If the remote host or an intermediary proxy does not support SSH multiplexing, add --remote-disable-ssh-mux to your invocation.

Port conflicts with ~/.taparc: If you omit the port in --remote-host, TAPA defaults to port 22 — it does not fall back to the port field from ~/.taparc. Always include the port explicitly (e.g., user@host:2222) when the remote host listens on a non-standard port.

Next step: Using the Visualizer

Using the Visualizer

Purpose: Inspect your TAPA design's task graph and dataflow using the visualizer.

When to use this: When you want to understand the task hierarchy and stream connections in your design, trace data flows between tasks, or navigate complex hierarchical designs.

What you need

A graph.json file generated by tapa compile (found in the work directory under work.out/)
A modern web browser (Chrome, Edge, Firefox, or other Chromium/Firefox-based browser)
The TAPA Visualizer web app — build it from the tapa-visualizer/ directory in the TAPA repository

Commands

Run tapa compile with a --work-dir to produce graph.json:
```
tapa --work-dir work.out compile --top VecAdd ...
```
Open the TAPA Visualizer in your browser.
Click the Choose File input in the top-left corner and select work.out/graph.json.

The graph loads and renders automatically after file selection.

TAPA Visualizer showing the task graph of a design

Interface components

The toolbar provides controls for working with the graph:

File controls:

Choose File — select a graph.json file to load.
Clear Graph — remove the current graph from the view.

Sub-task display modes — three modes control how task instances are shown:

Mode	Description
Merge Sub-task	One node per task type; all instances merged into a single node.
Separate Sub-task	One node per instance, named `taskname/0`, `taskname/1`, with connections named `connection/0`, `connection/1`, etc.
Expand Sub-task	One node per actual sub-task instance, each with its own sibling tree rather than being merged.

The three sub-task display modes side by side

The image above shows (left to right) Merge, Separate, and Expand modes. Notice the Load combo in the top-left: Mmap2Stream has 2 sub-tasks, which appear differently in each mode.

Action buttons:

Rerender Graph — re-lays out the graph and fits it to the view. Useful for large graphs or when using progressive layout algorithms like ForceAtlas2.
Fit Center — centers the graph in the view.
Fit View — centers and resizes the graph to fit the current viewport.
Save Image — exports the current graph as an image file.
Toggle Sidebar — shows or hides the information sidebar.

Tip

Hover over any toolbar button to see a tooltip with its name and function.

Interactive graph

The graph represents your TAPA design as a hierarchical, directed graph:

Nodes represent tasks. Color indicates connectivity: nodes with only incoming or outgoing connections appear in lighter colors; nodes with both appear darker.
Edges represent connections (typically FIFO streams) between tasks.
Combos (rectangular container areas) represent upper-level tasks containing nested tasks.

Supported interactions:

Interaction	Effect
Click an element	Displays its details in the sidebar.
Drag a node	Repositions the node.
Double-click a combo	Expands or collapses its contents.
Drag the background	Pans the view.
Shift+drag	Box selection.
Ctrl+drag	Lasso selection.

Box selection and lasso selection in the graph view

The visualizer sidebar showing Explorer, Details, and Connections tabs

The sidebar provides detailed information through several tabs:

Tab	Contents
Explorer	Hierarchical list of all tasks and sub-tasks; use it to quickly navigate complex designs.
Cflags	The compiler flags passed when building the graph.
Details	Comprehensive information about the currently selected element: task properties, parameters, and connectivity.
Connections	All connections and neighboring tasks for the selected element; useful for tracing data flows.
Options	Additional visualization settings: layout algorithm, task expansion options, and connection port visibility.

Validation

The visualizer is working correctly when:

The graph renders with nodes and edges visible after loading graph.json.
Clicking a node or edge populates the Details tab in the sidebar.
Double-clicking a combo expands or collapses its contents.

Browser compatibility

Category	Browsers
Fully supported	Chrome, Edge, and other Chromium-based browsers; Firefox and Firefox-based browsers
Partially supported	Safari and other WebKit-based browsers (should work but not extensively tested)
Unsupported	Internet Explorer and browsers not updated within the past 12 months

Warning

Using a modern, up-to-date browser is essential for both TAPA Visualizer compatibility and general web security.

If something goes wrong

Warning

If the graph fails to load or renders blank, check that graph.json was produced by tapa compile and is not empty. See Common Errors for further diagnosis.

Next step: Performance Tuning

Performance Tuning

Purpose: Identify and fix throughput bottlenecks in your TAPA design.

When to use this: When your design builds and runs correctly but measured throughput is below your target — for example, the kernel time is higher than expected or resource utilization is unexpectedly high.

What you need

A compiled .xo from tapa compile --work-dir work.out
Reports in work.out/ (synthesis reports, utilization data)
Understanding of your design's expected throughput

Prioritized checklist

Work through these checks in order — each is faster to fix than the next.

1. Check initiation interval (II) in synthesis reports

After tapa compile, check the HLS reports in work.out/ for II violations:

An II > 1 on a pipelined loop means the loop is not fully pipelined and throughput is reduced.
Look for WARNING: [HLS ...] Unable to schedule or II = N where N > 1 in the HLS log.

Fix: Add #pragma HLS pipeline II=1 or restructure the loop body to eliminate data-path dependencies.

2. Check memory throughput — consider `async_mmap`

Synchronous mmap accesses stall the task until each memory transaction completes. If your task spends time waiting for DRAM:

Use tapa::async_mmap to overlap computation and memory access.
Check the synthesis report for memory interface utilization.

3. Check stream depths — FIFOs too shallow?

FIFOs that are too shallow cause backpressure and reduce throughput when producer and consumer tasks run at different rates. If tasks are frequently stalling:

Increase the stream depth in your TAPA source: tapa::stream<T, DEPTH>.
Check waveforms from fast cosim (-xsim_save_waveform) to observe backpressure.

4. Find resource hotspots with `--enable-synth-util`

Run synthesis with utilization reporting enabled:

tapa --work-dir work.out synth \
  --enable-synth-util \
  --part-num xcu280-fsvh2892-2L-e \
  --clock-period 3.33

TAPA runs an additional RTL synthesis pass and writes per-task resource counts to:

work.out/report.json — machine-readable JSON
work.out/report.yaml — human-readable YAML

Both files contain per-task LUT, FF, BRAM, and DSP counts. Use them to identify which tasks are consuming the most resources before proceeding to full implementation.

Validation

After running tapa synth --enable-synth-util, confirm the reports were written:

ls work.out/report.json work.out/report.yaml

work.out/report.json — machine-readable per-task resource counts (LUT, FF, BRAM, DSP)
work.out/report.yaml — human-readable version of the same data

If these files are missing, synthesis either did not run or exited before the reporting step. Check the HLS log in work.out/ for errors.

Advanced synthesis flags

Controlling FIFO pipelining for floorplanning

By default, TAPA inserts pipeline registers into stream FIFOs to improve timing. When grouping FIFOs with their adjacent logic inside a single floorplan region, suppress pipelining for specific FIFOs:

tapa synth --nonpipeline-fifos fifos.json ...

fifos.json lists the FIFO names to suppress:

["fifo_a", "fifo_b"]

After synthesis, TAPA writes grouping_constraints.json to the work directory. Pass this file to RapidStream or other floorplanning tools.

AutoBridge graph generation

Generate an ab_graph.json for AutoBridge/RapidStream partition-based floorplanning:

tapa synth \
  --gen-ab-graph \
  --floorplan-config floorplan.json \
  ...

--floorplan-config is required when --gen-ab-graph is used. It specifies the target device floorplan regions.

GraphIR generation

Produce a GraphIR representation for RapidStream:

tapa synth \
  --gen-graphir \
  --device-config device.json \
  --floorplan-path floorplan.json \
  ...

Both --device-config and --floorplan-path are required:

Flag	Description
`--device-config PATH`	JSON file describing the physical device (SLR layout, DSP column positions, etc.)
`--floorplan-path PATH`	Floorplan assignment file applied to the program before GraphIR is emitted

The output is work.out/graphir.json, suitable for consumption by RapidStream.

Advanced flags summary

Flag	Description
`--enable-synth-util`	Run post-HLS RTL synthesis to collect per-task resource utilization.
`--disable-synth-util`	Do not run post-HLS RTL synthesis (default).
`--nonpipeline-fifos <json>`	Suppress pipeline registers for listed FIFOs; write `grouping_constraints.json`.
`--gen-ab-graph`	Generate `ab_graph.json` for AutoBridge/RapidStream floorplanning. Requires `--floorplan-config`.
`--floorplan-config PATH`	Device floorplan region description. Required with `--gen-ab-graph`.
`--gen-graphir`	Generate `graphir.json` for RapidStream. Requires `--device-config` and `--floorplan-path`.
`--device-config PATH`	Physical device description for GraphIR conversion. Required with `--gen-graphir`.
`--floorplan-path PATH`	Floorplan assignment applied before GraphIR emission. Required with `--gen-graphir`.

If something goes wrong

Warning

See Common Errors for help with synthesis failures, II violation messages, and resource overflows.

Next step: Learning Path

Learning Path

These labs walk through the TAPA programming model from first principles to advanced topics. Each lab builds on the previous one — you will understand each concept more deeply if you complete them in order. Allow roughly four hours to work through all six labs.

Labs

Lab	Topic	Prerequisites	Time	Skip if...
Lab 1: Vector Add	Core programming model	Your First Run	20 min	You already understand task graphs and mmap
Lab 2: High-Bandwidth Memory	async_mmap for memory throughput	Lab 1	30 min	You only need basic mmap
Lab 3: Migrating from Vitis HLS	Porting existing HLS code	Lab 1	30 min	You are new to FPGA HLS
Lab 4: Custom RTL Modules	Integrating hand-written RTL	Lab 1	45 min	You don't need to integrate RTL
Lab 5: Parallel RTL Emulation	Multi-kernel concurrent cosimulation	Lab 1, Fast Hardware Simulation	30 min	Your design is a single kernel
Lab 6: Floorplan & DSE	Floorplanning for multi-SLR FPGAs	Lab 2	60 min	You are not targeting multi-SLR devices

Where to start

New to FPGA HLS — Start at Lab 1. It introduces the task graph model that every later lab assumes you understand.

Coming from Vitis HLS — Lab 3 covers the mechanical differences, but reading Lab 1 first is worthwhile because TAPA's concurrency model is structurally different from standard HLS. If you have already read the Programming Model page, you can go directly to Lab 3.

Already ran vadd in First Run — You have seen the commands; Lab 1 does the deep-dive explanation of why the code is structured the way it is. It is worth reading even if the output was correct.

Need HBM throughput — Work through Lab 2 (async_mmap) and then Lab 6 (floorplanning). Both are required to get full memory bandwidth on multi-SLR devices.

Building a multi-kernel pipeline — Lab 5 covers parallel RTL emulation, which lets you validate inter-kernel dataflow at RTL level before the bitstream link step.

Background reading

Before starting any lab, the Programming Model page covers the vocabulary used throughout: task graphs, streams, mmap, and the compile pipeline. The labs assume you have read at least the Programming Model page.

Start here: Lab 1: Vector Add

Lab 1: Vector Add

Goal: Understand why the VecAdd design is structured as four concurrent tasks connected by streams, and what each structural choice means for hardware generation.

Prerequisites: Complete Your First Run so that you have already built and run the vadd example. This lab explains what you ran — it does not repeat the run commands.

After this lab you will understand:

How a top-level task orchestrates leaf tasks without containing computation
How mmap and stream arguments express data movement
How the host invocation connects host memory to the hardware kernel

Design overview

VecAdd computes c[i] = a[i] + b[i] for n elements. The implementation is a four-task pipeline:

Mmap2Stream(a) ──► a_q ──►
                           Add ──► c_q ──► Stream2Mmap(c)
Mmap2Stream(b) ──► b_q ──►

This is a producer-pipeline-consumer pattern. The two Mmap2Stream tasks read from global memory and feed elements into streams. Add consumes both streams and produces a result stream. Stream2Mmap drains the result stream back to global memory. All four tasks run concurrently once VecAdd is invoked — there is no sequencing between them.

The reason for this decomposition is not code style. TAPA generates separate hardware modules for each task, and the streams between them become FIFOs on the FPGA. When each stage is continuously supplied with data, the pipeline can run at full throughput.

`Mmap2Stream`

void Mmap2Stream(tapa::mmap<const float> mmap, uint64_t n,
                 tapa::ostream<float>& stream) {
  for (uint64_t i = 0; i < n; ++i) {
    stream << mmap[i];
  }
}

tapa::mmap<const float> is passed by value, not by reference. This is a hard rule in TAPA: mmap arguments to leaf tasks must be passed by value. The const qualifier marks the memory as read-only, which causes the compiler to generate a read-only AXI master port during synthesis. See mmap for details.

Inside the loop, mmap[i] is array-style access to global memory. Each access becomes an AXI read transaction. The << operator writes the element to the output stream, blocking if the FIFO is full. HLS can pipeline this loop at II=1 when the memory access latency is hidden by the pipeline depth.

`Add`

void Add(tapa::istream<float>& a, tapa::istream<float>& b,
         tapa::ostream<float>& c, uint64_t n) {
  for (uint64_t i = 0; i < n; ++i) {
    c << (a.read() + b.read());
  }
}

Stream arguments are passed by reference. This is the mirror of the mmap rule: streams must be by reference, mmap must be by value. See Tasks for a full explanation.

a.read() blocks until an element is available in the FIFO. This is safe here because the loop runs exactly n times, and Mmap2Stream feeds exactly n elements into each stream. There is no risk of deadlock as long as the element counts match.

The << on the output stream blocks if the downstream FIFO (c_q) is full. That backpressure propagates through the pipeline: Add stalls, which causes a_q and b_q to fill, which eventually stalls both Mmap2Stream tasks. The pipeline self-regulates without any explicit flow control logic.

HLS can pipeline this loop at II=1 because the operations (two reads and one add) are independent across iterations.

`Stream2Mmap`

void Stream2Mmap(tapa::istream<float>& stream, tapa::mmap<float> mmap,
                 uint64_t n) {
  for (uint64_t i = 0; i < n; ++i) {
    stream >> mmap[i];
  }
}

This is the mirror of Mmap2Stream. The >> operator reads one element from the stream (blocking) and writes it to global memory. The mmap is non-const this time because the output buffer is writable.

The same structural rules apply: mmap by value (non-const for write access), stream by reference.

`VecAdd` — the top-level task

void VecAdd(tapa::mmap<const float> a, tapa::mmap<const float> b,
            tapa::mmap<float> c, uint64_t n) {
  tapa::stream<float> a_q("a");
  tapa::stream<float> b_q("b");
  tapa::stream<float> c_q("c");

  tapa::task()
      .invoke(Mmap2Stream, a, n, a_q)
      .invoke(Mmap2Stream, b, n, b_q)
      .invoke(Add, a_q, b_q, c_q, n)
      .invoke(Stream2Mmap, c_q, c, n);
}

VecAdd contains no computation — no arithmetic, no memory access, no loops. This is deliberate. Upper-level tasks in TAPA are orchestration-only: they declare streams, then launch child tasks. Putting computation in an upper-level task is not supported.

The tapa::stream<float> declarations create named FIFOs. The string names ("a", "b", "c") are used by TAPA's debug infrastructure: setting TAPA_STREAM_LOG_DIR causes TAPA to log every element transferred through each named stream, which is useful when tracking down data corruption.

The .invoke() chain starts all four child tasks simultaneously. TAPA does not sequence them — there is no "run Mmap2Stream first, then Add". All four tasks are live from the moment VecAdd is invoked, and they communicate entirely through the stream FIFOs. The task graph is what determines data ordering, not the order of .invoke() calls.

For a full description of the task graph model, see The Programming Model.

Note

The .invoke() chain is syntactic sugar for constructing a tapa::task object and calling .invoke() on it repeatedly. Each call returns the same task object, which is why chaining works. The task object goes out of scope at the end of VecAdd, which causes TAPA to wait for all child tasks to finish before returning.

Host code

int64_t kernel_time_ns = tapa::invoke(
    VecAdd, FLAGS_bitstream,
    tapa::read_only_mmap<const float>(a),
    tapa::read_only_mmap<const float>(b),
    tapa::write_only_mmap<float>(c), n);

tapa::invoke is the host-side entry point. It is not the same as calling VecAdd() directly: calling VecAdd() would run it as a plain C++ function (software simulation without timing), while tapa::invoke selects the execution mode based on the bitstream path:

Empty string ("") — software simulation. TAPA runs VecAdd as C++ but with stream and mmap semantics enforced by the runtime library. Fast, no FPGA required.
.xo file — fast cosimulation. The synthesized RTL runs inside a cycle-accurate simulator. Useful for verifying timing-sensitive behavior.
.xclbin file — hardware execution on a real FPGA.

tapa::read_only_mmap<const float>(a) wraps the host vector a and tells the runtime to transfer it to the FPGA as a read-only buffer. tapa::write_only_mmap<float>(c) marks c as write-only, so the runtime transfers results back after the kernel finishes. These are directives to the runtime about transfer direction — they do not add C++ access restrictions beyond what the type already expresses.

For the actual build and run commands, see Your First Run.

Rules summary

Leaf task arguments: streams by reference (tapa::istream<T>&, tapa::ostream<T>&), mmap by value (tapa::mmap<T>)
Upper-level tasks: declare streams with tapa::stream<T>, invoke child tasks with .invoke(), contain no computation
Stream names (the string argument to tapa::stream<T>) are used by the debug infrastructure and appear in error messages — always name your streams
mmap const-ness (const float vs float) determines whether the synthesized AXI master port is read-only or read-write; transfer direction at runtime is set separately by read_only_mmap/write_only_mmap on the host side

Tip

If you see a compilation error about streams being passed by value or mmap being passed by reference, check your task signatures. TAPA enforces these argument-passing conventions at compile time.

Next step: Lab 2: High-Bandwidth Memory

Lab 2: High-Bandwidth Memory with async_mmap

Goal: Achieve high DRAM throughput by overlapping multiple outstanding memory requests using async_mmap.

Prerequisites: Lab 1: Vector Addition and Memory Access: async_mmap

After this lab you will understand:

Why sequential memory access wastes most of the available DRAM bandwidth
How the two-counter loop pattern keeps multiple requests in flight simultaneously
How to correctly coordinate the three write channels and drain write_resp

The problem: one request at a time

With a plain mmap<T> argument, each read or write is a blocking operation. The loop below looks innocuous, but every iteration stalls waiting for data to return from DRAM before the next address is issued:

// Problematic: one outstanding request at a time
for (int i = 0; i < n; i++) {
  result[i] = mem[i];  // blocks until data returns
}

Off-chip DRAM latency is typically 100–200 ns. At a 300 MHz clock that is 30–60 idle cycles per element. For sequential access patterns the HLS tool's burst inference may help, but for random-access patterns or when you need explicit control over request depth, mmap leaves most of the available bandwidth unused.

async_mmap solves this by exposing the five AXI channels directly as streams. You can issue many read addresses before any data returns, keeping dozens of requests in flight and hiding the per-request latency behind the steady flow of data. See Memory Access: async_mmap for the channel layout and area comparison.

Example 1: Overlapping reads with a single loop

The idiomatic TAPA read pattern uses two counters in a single pipelined loop:

void ReadKernel(tapa::async_mmap<float>& mem, float* result, uint64_t n) {
  for (int64_t i_req = 0, i_resp = 0; i_resp < (int64_t)n;) {
#pragma HLS pipeline II=1
    if (i_req < n && mem.read_addr.try_write(i_req)) ++i_req;
    float val;
    if (mem.read_data.try_read(val)) {
      result[i_resp] = val;
      ++i_resp;
    }
  }
}

How it works:

i_req tracks how many addresses have been issued; i_resp tracks how many responses have been received.
The loop condition is i_resp < n: it runs until every response is collected, not just until every address is sent.
mem.read_addr.try_write(i_req) is non-blocking. If the address channel is full this cycle, it returns false and the address is retried on the next cycle. i_req only advances when the write succeeds.
mem.read_data.try_read(val) is non-blocking. If no data has arrived yet, it returns false and the loop continues without blocking.
Because both branches are independent and non-blocking, the loop can issue a new address and receive a response in the same clock cycle.
The difference i_req - i_resp is the current number of in-flight requests. The hardware limits this to the channel depth; TAPA coalesces sequential addresses into AXI bursts automatically at runtime, so you never need to write explicit burst logic.

Example 2: Sequential writes with burst detection

Writes require coordinating three channels: write_addr, write_data, and write_resp. The pattern checks all three are ready before committing:

void WriteKernel(tapa::async_mmap<float>& mem,
                 tapa::istream<float>& data, uint64_t n) {
  for (int64_t i_req = 0, i_resp = 0; i_resp < (int64_t)n;) {
#pragma HLS pipeline II=1
    if (i_req < n && !data.empty() &&
        !mem.write_addr.full() && !mem.write_data.full()) {
      mem.write_addr.try_write(i_req);
      mem.write_data.try_write(data.read(nullptr));
      ++i_req;
    }
    uint8_t ack;
    if (mem.write_resp.try_read(ack)) {
      i_resp += unsigned(ack) + 1;  // ack encodes burst length - 1
    }
  }
}

Key points:

Before issuing a write, all three preconditions must hold: the input stream must have data, and neither the address nor the data channel may be full. Checking them together prevents partial commits.
write_resp must be consumed even if you do not use the count. The hardware stops accepting new write addresses once the write_resp FIFO fills up, causing deadlock if the kernel never drains it.
The ack value encodes burst_length - 1. TAPA detects that you are issuing sequential addresses and merges them into AXI bursts at runtime. A single write_resp entry can therefore acknowledge many writes, which is why i_resp += unsigned(ack) + 1 rather than i_resp += 1.

Rules for using async_mmap

Pass async_mmap<T> by reference (async_mmap<T>&). Passing by value is an error.
Only use try_read/try_write inside pipelined loops. Blocking read/write stalls the pipeline and will cause deadlock when combined with other non-blocking channels.
Always drain write_resp, even if you discard the burst-length value.
An mmap<T> argument can be passed to an async_mmap<T>& parameter in a child task without changing the caller.

Warning

Never use blocking read/write on async_mmap channels inside a pipelined loop. Because the five AXI channels are decoupled, blocking on one channel prevents progress on the others and causes the kernel to hang.

Tip

For the full API reference and the area comparison table showing how async_mmap compares to the Vitis HLS m_axi interface, see Memory Access: async_mmap.

Next step: Lab 3: Migrating from Vitis HLS

Lab 3: Migrating from Vitis HLS

Goal: Port an existing Vitis HLS kernel to TAPA by replacing HLS-specific constructs with their TAPA equivalents.

Prerequisites: Lab 1: Vector Addition and familiarity with the TAPA task model.

After this lab you will understand:

The mechanical substitutions that cover most Vitis HLS kernels
Why the dataflow-in-a-loop pattern must be restructured in TAPA
How tapa::hls::stream supports incremental migration of large codebases

Quick reference: Vitis HLS → TAPA

Vitis HLS	TAPA	Notes
`#include <hls_stream.h>`	`#include <tapa.h>`	TAPA includes its own stream types
`T* port` + `#pragma HLS INTERFACE m_axi`	`tapa::mmap<T> port` (by value)	Remove all `m_axi` pragmas
`hls::stream<T>&`	`tapa::istream<T>&` or `tapa::ostream<T>&`	Direction is explicit in TAPA
`#pragma HLS dataflow` + direct calls	`tapa::task().invoke(...)`	Tasks run concurrently
Top function contains computation	Move computation into child tasks	TAPA upper-level tasks are orchestration-only
`hls::stream<T>` local variable	`tapa::stream<T>` local variable	Same syntax; depth is enforced during software simulation (default depth: 2)

Example 1: Basic VecAdd migration

The full before and after files are at example_1_before.cpp and example_1_after.cpp.

Step 1: Replace the include

-#include <hls_stream.h>
-#include <hls_vector.h>
+#include <hls_vector.h>
+#include <tapa.h>

TAPA provides its own stream types, so hls_stream.h is no longer needed. Other HLS headers such as ap_int.h and hls_vector.h are still supported and can be included as usual.

Step 2: Replace pointer arguments with `tapa::mmap<T>`

Vitis HLS uses raw pointers annotated with #pragma HLS INTERFACE m_axi to indicate off-chip memory. TAPA replaces this with tapa::mmap<T> passed by value, and no pragma is needed:

-void load_input(hls::vector<uint32_t, NUM_WORDS>* in,
+void load_input(tapa::mmap<hls::vector<uint32_t, NUM_WORDS>> in,

-  hls::vector<uint32_t, NUM_WORDS>* in1,
-  hls::vector<uint32_t, NUM_WORDS>* in2,
-  hls::vector<uint32_t, NUM_WORDS>* out, int size) {
-#pragma HLS INTERFACE m_axi port = in1 bundle = gmem0
-#pragma HLS INTERFACE m_axi port = in2 bundle = gmem1
-#pragma HLS INTERFACE m_axi port = out bundle = gmem0
+  tapa::mmap<hls::vector<uint32_t, NUM_WORDS>> in1,
+  tapa::mmap<hls::vector<uint32_t, NUM_WORDS>> in2,
+  tapa::mmap<hls::vector<uint32_t, NUM_WORDS>> out, int size) {

tapa::mmap<T> supports element-indexed reads and writes (mem[i]) just like a pointer, so the body of each task usually does not need to change.

Step 3: Replace `hls::stream<T>&` with directional TAPA streams

Vitis HLS hls::stream<T>& is bidirectional — the same type is used whether the stream is read or written. TAPA makes direction explicit:

-void compute_add(hls::stream<hls::vector<uint32_t, NUM_WORDS>>& in1_stream,
-                 hls::stream<hls::vector<uint32_t, NUM_WORDS>>& in2_stream,
-                 hls::stream<hls::vector<uint32_t, NUM_WORDS>>& out_stream,
+void compute_add(tapa::istream<hls::vector<uint32_t, NUM_WORDS>>& in1_stream,
+                 tapa::istream<hls::vector<uint32_t, NUM_WORDS>>& in2_stream,
+                 tapa::ostream<hls::vector<uint32_t, NUM_WORDS>>& out_stream,

Use tapa::istream<T>& for streams the task reads from, and tapa::ostream<T>& for streams the task writes to. The read() and << operators work the same as in Vitis HLS.

Step 4: Replace local `hls::stream<T>` declarations

Local streams declared inside the top-level function become tapa::stream<T>:

-  hls::stream<hls::vector<uint32_t, NUM_WORDS>> in1_stream("input_stream_1");
-  hls::stream<hls::vector<uint32_t, NUM_WORDS>> in2_stream("input_stream_2");
-  hls::stream<hls::vector<uint32_t, NUM_WORDS>> out_stream("output_stream");
+  tapa::stream<hls::vector<uint32_t, NUM_WORDS>> in1_stream("input_stream_1");
+  tapa::stream<hls::vector<uint32_t, NUM_WORDS>> in2_stream("input_stream_2");
+  tapa::stream<hls::vector<uint32_t, NUM_WORDS>> out_stream("output_stream");

tapa::stream<T> accepts a name string for the same debugging purpose as hls::stream<T>. To set a custom depth, use tapa::stream<T, DEPTH>. For stream arrays, use tapa::streams<T, ARRAY_SIZE, DEPTH>.

Note

The default stream depth in TAPA is 2, matching the Vitis HLS default. Unlike Vitis HLS, TAPA enforces the depth during software simulation, which helps catch backpressure bugs before synthesis.

Step 5: Replace `#pragma HLS dataflow` with `tapa::task().invoke(...)`

Vitis HLS uses #pragma HLS dataflow to signal that a sequence of direct function calls should run as concurrent processes. TAPA replaces this with an explicit task graph:

-#pragma HLS dataflow
-  load_input(in1, in1_stream, size);
-  load_input(in2, in2_stream, size);
-  compute_add(in1_stream, in2_stream, out_stream, size);
-  store_result(out, out_stream, size);
+  tapa::task()
+      .invoke(load_input, in1, in1_stream, size)
+      .invoke(load_input, in2, in2_stream, size)
+      .invoke(compute_add, in1_stream, in2_stream, out_stream, size)
+      .invoke(store_result, out, out_stream, size);

All tasks in a tapa::task().invoke(...) chain run concurrently. The top-level function becomes pure orchestration — it declares streams, then hands everything off to child tasks.

Example 2: Dataflow-in-a-loop

The full before and after files are at example_2_before.cpp and example_2_after.cpp.

Vitis HLS permits #pragma HLS dataflow inside a for loop. Each iteration starts a new concurrent dataflow region:

// Vitis HLS: dataflow region restarts each iteration
size /= NUM_WORDS;
for (int i = 0; i < size; i++) {
#pragma HLS dataflow
  load_input(in1, in1_stream, i);
  load_input(in2, in2_stream, i);
  compute_add(in1_stream, in2_stream, out_stream);
  store_result(out, out_stream, i);
}

TAPA does not allow computation in upper-level tasks. A top-level TAPA task may only declare streams and invoke child tasks — it cannot contain loops or arithmetic. The solution is to move the loop into each child task:

// TAPA: loop lives in the child tasks
void load_input(tapa::mmap<hls::vector<uint32_t, NUM_WORDS>> in,
                tapa::ostream<hls::vector<uint32_t, NUM_WORDS>>& inStream,
                int size) {
  size /= NUM_WORDS;
  for (int i = 0; i < size; i++) {
#pragma HLS pipeline II = 1
    inStream << in[i];
  }
}

The top-level task then becomes:

void vadd(...) {
  tapa::stream<...> in1_stream(...);
  tapa::stream<...> in2_stream(...);
  tapa::stream<...> out_stream(...);

  tapa::task()
      .invoke(load_input, in1, in1_stream, size)
      .invoke(load_input, in2, in2_stream, size)
      .invoke(compute_add, in1_stream, in2_stream, out_stream, size)
      .invoke(store_result, out, out_stream, size);
}

The child tasks stream data to each other for the full duration; no synchronization is needed between iterations because each task has its own loop that runs from start to finish.

HLS-compat helpers for incremental migration

If you have a large existing codebase, TAPA provides tapa::hls::stream<T> as a drop-in replacement for hls::stream<T>. Unlike tapa::stream<T>, it uses effectively infinite depth in software simulation, so producers never block. This lets you keep direction-agnostic stream passing patterns while still running software simulation.

tapa::hls::stream<T> is available via #include <tapa.h> — no additional include is needed.

// Before (Vitis HLS):
hls::stream<float>& s

// After (TAPA compat, passes software simulation without depth tuning):
tapa::hls::stream<float>& s

Use this as a stepping stone: get software simulation passing with tapa::hls::stream, then replace with directional tapa::istream<T>& / tapa::ostream<T>& before shipping.

Note

tapa::hls::stream synthesizes correctly — the generated RTL FIFO is identical to tapa::stream<T, N>. The reason to replace it before hardware build is that the infinite simulation depth hides backpressure bugs. Switching to directional streams with a tuned depth catches those bugs during software simulation, before they appear on hardware.

Next step: Lab 4: Custom RTL Modules

Lab 4: Custom RTL Modules

Goal: Replace a TAPA task with a hand-written RTL module while keeping a C++ behavior model for software simulation.

Prerequisites: Lab 1: Vector Addition and familiarity with the TAPA compile pipeline.

After this lab you will understand how to write a C++ behavior model for an ignored task, label it for RTL replacement, generate RTL port templates, provide custom RTL, and repack into a deployable XO.

When to use this

Use custom RTL modules when:

An existing RTL implementation is available from a vendor IP catalog or a prior design, and reimplementing it in HLS would be wasteful.
A task requires timing, area, or interface characteristics that HLS cannot produce.
A task is too complex to express in synthesizable C++ and a direct RTL description is more practical.

Overview

The workflow has three parts:

Write a C++ behavior model that correctly implements the task — this is what runs during software simulation. The code does not need to be synthesizable.
Wrap the behavior model in a task annotated with [[tapa::target("ignore")]]. TAPA compiles the rest of the design normally and generates RTL port template files for the ignored task instead of synthesizing it.
Provide the actual RTL implementation and repack the XO.

Example: using a vendor floating-point IP

Suppose you have a task that computes element-wise reciprocal square root and want to use Xilinx's Floating-Point IP core rather than the HLS-generated logic.

Step 1: Write the C++ behavior model

The behavior model lives in an ordinary task function. It will be called during software simulation and will never be synthesized, so it can use any C++ — standard library calls, dynamic containers, whatever is convenient and correct.

#include <cmath>
#include <tapa.h>

// Behavior model: runs during software simulation only.
// Uses std::sqrt — this does not need to be synthesizable.
void RsqrtCore(tapa::istream<float>& in, tapa::ostream<float>& out,
               uint64_t n) {
  for (uint64_t i = 0; i < n; ++i) {
    float val = in.read();
    out.write(1.0f / std::sqrt(val));  // stdlib call: fine for simulation
  }
}

Step 2: Wrap with `[[tapa::target("ignore")]]`

Create a thin wrapper that invokes the behavior model. The [[tapa::target("ignore")]] attribute tells TAPA to skip synthesis of this wrapper and generate RTL port templates in its place. During software simulation the wrapper runs normally, which in turn calls RsqrtCore.

[[tapa::target("ignore")]] void Rsqrt(
    tapa::istream<float>& in, tapa::ostream<float>& out, uint64_t n) {
  tapa::task().invoke(RsqrtCore, in, out, n);
}

Note

Only the wrapper needs the attribute. The behavior model (RsqrtCore) is a plain task function. Software simulation runs the wrapper as usual; synthesis skips it and generates port templates.

Step 3: Integrate into the top-level task

void Pipeline(tapa::mmap<const float> in, tapa::mmap<float> out, uint64_t n) {
  tapa::stream<float> in_q("in");
  tapa::stream<float> out_q("out");

  tapa::task()
      .invoke(Load, in, n, in_q)
      .invoke(Rsqrt, in_q, out_q, n)   // custom RTL replaces this
      .invoke(Store, out_q, out, n);
}

Step 4: Compile to generate template files

tapa compile \
  --top Pipeline \
  --part-num xcu250-figd2104-2L-e \
  --clock-period 3.33 \
  -f pipeline.cpp \
  -o work.out/pipeline.xo

Because Rsqrt is tagged ignore, TAPA generates RTL template files under work.out/template/. These templates define the exact port signatures the replacement RTL module must match.

Step 5: Implement the RTL

Write or adapt your RTL files so their port declarations match the generated templates. When you run tapa pack --custom-rtl in the next step, TAPA performs advisory port checking on .v files: it warns on mismatches but does not abort the build. Resolve any reported mismatches before moving to hardware.

Step 6: Repack with custom RTL

Two workflows are available depending on whether you are iterating on the RTL separately from the HLS compilation step.

Option A — Two-step workflow (compile once, iterate on RTL separately):

tapa pack \
  -o work.out/pipeline.xo \
  --custom-rtl ./rtl/

Option B — One-step workflow (compile and pack together):

tapa compile \
  --top Pipeline \
  --part-num xcu250-figd2104-2L-e \
  --clock-period 3.33 \
  -f pipeline.cpp \
  -o work.out/pipeline.xo \
  --custom-rtl ./rtl/

--custom-rtl accepts a file path or a directory. To include multiple paths, repeat the flag. .v files receive advisory port checking; other file types (for example .tcl) are packaged without format checking.

Software simulation with the behavior model

Because the behavior model is plain C++, software simulation works exactly as for any other TAPA design:

tapa g++ -- pipeline.cpp host.cpp -o pipeline
./pipeline

The behavior model does not need to match the RTL cycle-accurately — it only needs to produce the correct output values. Use this to validate host logic and data paths before RTL is ready.

Note

The behavior model code can freely use unsynthesizable constructs: standard library functions, dynamic allocation, floating-point math, file I/O for golden output comparison, and so on. TAPA never attempts to synthesize it.

Validation

After repacking, run fast cosim to verify the custom RTL produces correct results before committing to a full bitstream build:

./pipeline --bitstream=work.out/pipeline.xo 1000

Catching functional bugs at cosim time is far cheaper than discovering them after hours of bitstream generation.

Full example

The complete working example is in tests/functional/custom-rtl in the TAPA repository.

Next step: Lab 5: Parallel RTL Emulation

Lab 5: Parallel RTL Emulation

Goal: Compile cycle-sensitive kernel modules to RTL and simulate them concurrently, reducing total cosim time while preserving cycle-accurate behavior where it matters.

Prerequisites: Lab 1: Vector Addition and Fast Hardware Simulation.

After this lab you will understand how to use tapa::executable to assign per-kernel RTL targets, compile each kernel to its own .xo, run the simulations in parallel, and prevent work-directory collisions between concurrent instances.

When to use this

RTL cosimulation gives you cycle-accurate behavior for the logic inside each kernel — pipeline depths, stall conditions, hazards, and II violations that software simulation cannot catch. However, not everything needs this level of fidelity:

FIFOs between kernels are modeled as shared-memory queues, not cycle-accurate RTL. The latency across kernel boundaries is not representative of hardware.
Memory accesses (mmap, async_mmap) are similarly abstracted; memory latency is not cycle-accurate.

Parallel RTL emulation is therefore most valuable for validating the cycle-sensitive internals of each kernel in isolation — compute pipelines, II, resource usage — rather than end-to-end timing across the full datapath.

Running one cosim process per kernel and launching them concurrently reduces wall-clock time compared to simulating everything in a single process or sequentially. Use it when:

Your design contains multiple kernels with non-trivial compute pipelines that need cycle-accurate validation.
You want to catch pipeline hazards, incorrect II, or RTL-level bugs in each kernel before the expensive bitstream link step.
The kernels can be compiled and simulated independently.

Concept

In a standard single-kernel design, one top-level function compiles to one .xo and one cosim process validates it. In the parallel emulation pattern, several kernel functions compile independently and the host program runs one cosim process per kernel, all concurrently:

tapa::task()
  .invoke(KernelA, tapa::executable(FLAGS_a_bitstream), ...)  ──▶  cosim process A (cycle-accurate)
  .invoke(KernelB, tapa::executable(FLAGS_b_bitstream), ...)  ──▶  cosim process B (cycle-accurate)
  .invoke(KernelC, tapa::executable(FLAGS_c_bitstream), ...)  ──▶  cosim process C (cycle-accurate)

The streams connecting the processes are shared-memory FIFOs managed by the host runtime — latency-insensitive data transfer that lets each cosim process run at its own pace. Each kernel's internal cycle behavior is faithfully simulated; the inter-kernel communication is not.

Step 1: Write the kernels

Each kernel is a plain TAPA task function. The Cannon matrix-multiply example from tests/functional/parallel-emulation/ uses three kernel functions — Scatter, ProcElem, and Gather — all in one source file:

// Distribute matrix rows into per-PE stream arrays
void Scatter(tapa::mmap<const float> matrix,
             tapa::ostreams<float, p * p>& block) { ... }

// Each PE computes its sub-matrix tile
void ProcElem(tapa::istream<float>& a_fifo, tapa::istream<float>& b_fifo,
              tapa::ostream<float>& c_fifo, ...) { ... }

// Collect PE results into the output matrix
void Gather(tapa::mmap<float> matrix,
            tapa::istreams<float, p * p>& block) { ... }

The top-level function declares the shared streams and assembles the task graph using tapa::executable:

DEFINE_string(scatter_bitstream, "", "XO for Scatter; empty = software simulation");
DEFINE_string(proc_elem_bitstream, "", "XO for ProcElem; empty = software simulation");
DEFINE_string(gather_bitstream, "", "XO for Gather; empty = software simulation");

void Cannon(tapa::mmap<const float> a_vec, tapa::mmap<const float> b_vec,
            tapa::mmap<float> c_vec, uint64_t n) {
  tapa::streams<float, p * p> a("a"), b("b"), c("c");
  // ... inter-PE streams ...

  tapa::task()
      .invoke(Scatter, tapa::executable(FLAGS_scatter_bitstream), a_vec, a)
      .invoke(Scatter, tapa::executable(FLAGS_scatter_bitstream), b_vec, b)
      .invoke(ProcElem, tapa::executable(FLAGS_proc_elem_bitstream), a, b, c, ...)
      // ... more ProcElem instances ...
      .invoke(Gather, tapa::executable(FLAGS_gather_bitstream), c_vec, c);
}

When a FLAGS_*_bitstream flag is empty, that invocation falls back to software simulation automatically. This lets you bring up one kernel at a time.

Step 2: Compile each kernel separately

Each kernel function is compiled independently with its own tapa compile --top invocation:

tapa compile --top Scatter  --part-num xcu250-figd2104-2L-e --clock-period 3.33 \
  -f cannon.cpp -o scatter.xo

tapa compile --top ProcElem --part-num xcu250-figd2104-2L-e --clock-period 3.33 \
  -f cannon.cpp -o proc-elem.xo

tapa compile --top Gather   --part-num xcu250-figd2104-2L-e --clock-period 3.33 \
  -f cannon.cpp -o gather.xo

The three compilations read the same source file but each targets a different top function. The outputs are independent .xo files with no knowledge of each other.

Step 3: Run parallel emulation

Pass all three .xo files to the host binary. All cosim processes start concurrently:

./cannon-host \
    --scatter_bitstream=scatter.xo \
    --proc_elem_bitstream=proc-elem.xo \
    --gather_bitstream=gather.xo

Preventing work-directory collisions

By default each cosim process uses a temporary directory that is deleted at exit. When multiple processes share an explicit -cosim_work_dir, their intermediate files collide. Use -cosim_work_dir_parallel to give each process a unique subdirectory:

./cannon-host \
    --scatter_bitstream=scatter.xo \
    --proc_elem_bitstream=proc-elem.xo \
    --gather_bitstream=gather.xo \
    -cosim_work_dir ./cosim_work \
    -cosim_work_dir_parallel

TAPA creates ./cosim_work/XXXXXX/ per instance so the simulations do not interfere.

Limiting concurrency

On memory-constrained machines, set TAPA_CONCURRENCY to cap the number of running cosim processes:

TAPA_CONCURRENCY=1 ./cannon-host \
    --scatter_bitstream=scatter.xo \
    --proc_elem_bitstream=proc-elem.xo \
    --gather_bitstream=gather.xo

Even with TAPA_CONCURRENCY=1 the processes exchange data correctly through shared-memory FIFOs; they just run one at a time.

Step 4: Verify

A successful run prints the application's correctness result (e.g., PASS!) after all simulation processes finish. Diagnose failures the same way as single-kernel cosim: add -cosim_work_dir and -xsim_save_waveform to inspect per-kernel waveforms.

Lab 6: Floorplan & DSE

Goal: Use TAPA's floorplan design space exploration (DSE) to achieve timing closure on multi-SLR FPGAs.

Prerequisites: Lab 2: High-Bandwidth Memory and familiarity with synthesis flags from Performance Tuning.

After this lab you will understand how to apply a floorplan solution to a compile step and, if the RapidStream optimization tool is available, how to generate floorplan solutions automatically.

Overview

Multi-SLR FPGAs (U250, U280, U55C, and similar) partition logic across physically separate silicon dies connected by SLR crossings. Long wires that cross SLR boundaries are a common source of timing failures. TAPA's floorplan tooling addresses this by:

Assigning tasks to specific SLR regions.
Automatically inserting pipeline registers on streams that cross SLR boundaries.
Running a design space exploration to find placement configurations that stay within per-SLR resource limits.

Tool dependency

The floorplan generation step — which searches for optimal task-to-SLR assignments — requires rapidstream-tapaopt, an optimization tool historically provided by RapidStream Design Automation. This tool is no longer publicly accessible. If you hold a license, the full two-workflow process described below applies. If you do not, you can still apply a hand-written or externally provided floorplan.json directly using Workflow A Step 2, skipping the generation step.

Note

Compiling a design with a floorplan applied — inserting pipeline registers and reorganizing the task hierarchy — works without rapidstream-tapaopt. Only the automated search for floorplan solutions requires the external tool.

Workflow A: Manual floorplan

Use this workflow when you want to inspect individual floorplan solutions before committing to a full compile, or when you already have a floorplan.json from another source.

Step 1: Generate floorplan solutions (requires `rapidstream-tapaopt`)

tapa generate-floorplan \
  -f kernel.cpp \
  -t kernel0 \
  --device-config device_config.json \
  --floorplan-config floorplan_config.json \
  --clock-period 3.00 \
  --part-num xcu55c-fsvh2892-2L-e

This runs the DSE and writes one or more floorplan_N.json files to the working directory. Each file represents a distinct placement solution.

Step 2: Compile with a chosen solution

tapa compile \
  -f kernel.cpp \
  -t kernel0 \
  --floorplan-path floorplan_0.json \
  --clock-period 3.00 \
  --part-num xcu55c-fsvh2892-2L-e \
  --flatten-hierarchy

Warning

--floorplan-path requires --flatten-hierarchy. Omitting --flatten-hierarchy will cause the compile to fail.

TAPA reorganizes the task hierarchy according to the chosen floorplan and inserts pipeline registers at all SLR-crossing streams. This step does not require rapidstream-tapaopt.

Workflow B: Automated DSE (requires `rapidstream-tapaopt`)

Use this workflow to generate and compile all floorplan solutions in one step without manual inspection between them.

tapa compile-with-floorplan-dse \
  -f kernel.cpp \
  -t kernel0 \
  --device-config device_config.json \
  --floorplan-config floorplan_config.json \
  --clock-period 3.00 \
  --part-num xcu55c-fsvh2892-2L-e

compile-with-floorplan-dse runs the DSE, then compiles and applies pipeline insertion for each floorplan solution it generates. Use this when you want to produce all candidates in one automated run and pick the best result based on downstream timing reports.

Floorplan config format

The --floorplan-config JSON controls how the DSE searches for placement solutions. A representative example:

{
  "max_seconds": 1000,
  "dse_range_min": 0.7,
  "dse_range_max": 0.88,
  "partition_strategy": "flat",
  "cpp_arg_pre_assignments": {
    "a": "SLOT_X1Y0:SLOT_X1Y0",
    "b_0": "SLOT_X2Y0:SLOT_X2Y0"
  },
  "sys_port_pre_assignments": {
    "ap_clk": "SLOT_X2Y0:SLOT_X2Y0"
  }
}

Key fields:

dse_range_min / dse_range_max — The acceptable per-SLR resource utilization range (as a fraction of 1.0). The DSE only keeps placements where every SLR falls within this band.
cpp_arg_pre_assignments — Forces specific top-function kernel arguments to specific SLR slots. Values are SLOT_XmYn:SLOT_XmYn strings. Array arguments can be matched with regex patterns (for example "c_.*" matches c_0, c_1, etc.).
sys_port_pre_assignments — Forces Verilog system ports (clock, reset, AXI control) to specific slots. Regex patterns are supported here as well.

The full set of available fields (including grouping_constraints, slot_to_rtype_to_min_limit, and others) is documented in the RapidStream floorplan configuration reference.

Examples Catalog

The TAPA repository includes two sets of example designs. Small self-contained examples live under tests/apps/. Larger benchmarks live under tests/regression/.

Small examples

Example	Problem type	Key TAPA feature	Location
vadd	Vector addition	Basic streams + mmap	`tests/apps/vadd`
bandwidth	Memory bandwidth benchmark	`async_mmap`, 32 HBM channels	`tests/apps/bandwidth`
network	Packet switching	`peek`, detached tasks, hierarchical tasks	`tests/apps/network`
cannon	Cannon's matrix multiply	2D stream arrays, systolic	`tests/apps/cannon`
jacobi	Stencil computation	End-of-transmission (`close()`)	`tests/apps/jacobi`

Published benchmarks

Example	Problem type	Key feature	Published in
autosa mm/10x13	Matrix multiplication	AutoSA-generated systolic (90% U55C LUT)	—
callipepla	Conjugate gradient	26 HBM channels	FPGA'23
cnn	CNN systolic array	Multi-SLR	FPGA'21
lu_decompose	LU systolic array	Multi-SLR	FPGA'21
hbm-bandwidth	HBM bandwidth profiler	`async_mmap`, all 32 channels	—
hbm-bandwidth-1-ch	HBM bandwidth (1 channel)	Minimal `async_mmap`	—
serpens	Sparse SpMV	Multiple HBM channels, scalable parallelism	DAC'22
spmm	Sparse SpMM	HBM streams	FPGA'22
spmv-hisparse-mmap	Sparse SpMV (HiSparse)	mmap-based SpMV	FPGA'22
knn	K-nearest-neighbor	FPT accelerator	FPT'20
page_rank	Page Rank	FCCM accelerator	FCCM'21

Note

The tests/regression/ directory is under active development; new designs are added regularly. Check the repository for the latest list.

Next step: Common Errors

Common Errors

Symptom descriptions and fixes for the most common compile-time and runtime errors.

When to use this page: When tapa g++ or tapa compile reports an error, or when software simulation crashes or produces wrong output.

Stream passed by value

Symptom: Compile error mentioning a deleted copy constructor, or that istream/ostream is not CopyConstructible.

Cause: The stream parameter is declared without &. Streams are non-copyable objects — they represent live communication channels between tasks, not data values.

Fix: Always pass streams by reference.

// Wrong
void Task(tapa::istream<int> in, tapa::ostream<int> out) { ... }

// Right
void Task(tapa::istream<int>& in, tapa::ostream<int>& out) { ... }

`mmap` passed by reference

Symptom: Compile error about a type mismatch or an unexpected & on an mmap parameter.

Cause: tapa::mmap<T> is essentially a pointer to a memory region and must be passed by value, not by reference.

Fix: Remove the & from mmap parameters.

// Wrong
void Task(tapa::mmap<int>& mem) { ... }

// Right
void Task(tapa::mmap<int> mem) { ... }

`async_mmap` passed by value

Symptom: Passing async_mmap by value is deprecated and may produce a warning or error depending on the TAPA version.

Cause: tapa::async_mmap<T> is a set of streams that controls memory access. Like regular streams, it must be passed by reference.

Fix: Always pass async_mmap by reference.

// Wrong
void Task(tapa::async_mmap<int> mem) { ... }

// Right
void Task(tapa::async_mmap<int>& mem) { ... }

Computation in upper-level task body

Symptom: tapacc reports an error about computation in an upper-level task, or the design fails synthesis unexpectedly.

Cause: Upper-level tasks (tasks that invoke other tasks) may only contain stream declarations and .invoke() chains. Any arithmetic, conditionals, or other function calls belong in leaf tasks. For example, computing n * 2 directly in TopLevel is not allowed:

// Wrong
void TopLevel(int n, tapa::mmap<int> mem) {
  tapa::stream<int> s("s");
  tapa::task()
    .invoke(Task1, s, mem, n * 2)
    .invoke(Task2, s, n * 2);
}

Fix: Move the computation into the child task that uses the result.

// Right
void Task2(tapa::istream<int>& in, int n) {
  n = n * 2;
  // use n ...
}

void TopLevel(int n, tapa::mmap<int> mem) {
  tapa::stream<int> s("s");
  tapa::task()
    .invoke(Task1, s, mem, n)
    .invoke(Task2, s, n);
}

Stream array declared as `stream[]` instead of `streams<>`

Symptom: Compile error or incorrect behavior when defining or passing arrays of streams.

Cause: tapa::stream<T> arr[N] is not copyable or movable in the way TAPA expects. Arrays of streams must use the dedicated tapa::streams<T, N> type.

Fix: Use tapa::streams<T, N> for stream arrays, and use .invoke with a count to distribute elements rather than indexing manually.

// Wrong
tapa::stream<int> data_q[4];
tapa::task().invoke(Task, data_q[0], mem[0])
            .invoke(Task, data_q[1], mem[1]);

// Right
tapa::streams<int, 4> data_q;
tapa::mmaps<int, 4> mem;
tapa::task().invoke<tapa::join, 4>(Task, data_q, mem);

`tapac` not found

Symptom: Shell reports command not found: tapac.

Cause: tapac was the old command name. It has been replaced by tapa compile.

Fix: Replace tapac with tapa compile. Most flags carry over directly.

# Old
tapac --top VecAdd -f vadd.cpp -o vadd.xo ...

# New
tapa compile --top VecAdd -f vadd.cpp -o vadd.xo ...

Run tapa compile --help for the full option list.

Tasks not defined in the same compilation unit as the top-level function

Symptom: tapacc cannot find a task function, or a link error occurs for a task symbol.

Cause: TAPA requires all task functions to be visible in the same compilation unit as the top-level function. Placing tasks in separate .cpp files means the compiler never sees them together.

Fix: Define tasks in header files and #include them in the main kernel file.

// task1.hpp
void Task1(/* ... */) { /* ... */ }

// task2.hpp
void Task2(/* ... */) { /* ... */ }

// top_level.cpp
#include "task1.hpp"
#include "task2.hpp"

void TopLevel(/* ... */) {
  tapa::task().invoke(Task1, /* ... */).invoke(Task2, /* ... */);
}

Static variables behave differently in simulation vs hardware

Symptom: Software simulation produces different output than hardware execution.

Cause: Static variables are shared across all invocations within a single simulation process. In hardware, each task instance synthesizes its own independent copy of the variable.

For example:

void Task() {
  static int counter = 0;
  counter++;
}

tapa::task().invoke(Task).invoke(Task);

In software simulation counter reaches 2 (one shared variable, incremented twice). In hardware each instance has its own counter, so both instances end at 1.

Fix: Avoid static variables inside tasks. Pass state between tasks using stream or mmap arguments.

Tip

If a parameter type mismatch error is confusing, work through this checklist:

Does the number of arguments at the call site match the task signature?
Are stream directions correct — istream for reads, ostream for writes?
Are passing conventions correct — streams and async_mmap by reference, mmap by value?
Is the parameter order the same between the call site and the task definition?

See also: Deadlocks & Hangs | Cosimulation Issues

Deadlocks & Hangs

When to use this page: When software simulation or fast cosim hangs without producing output, or terminates without printing results.

Note

tapa::stream enforces the declared depth in both software simulation and fast cosim/RTL. A blocking write() on a full stream yields the current coroutine and retries until space is available — so shallow stream depth can deadlock in software simulation too. The exception is tapa::hls::stream (the Vitis HLS compatibility alias), which uses effectively infinite depth in software simulation.

Diagnosis checklist

Work through the following causes in order — they are listed from most to least common.

1. Stream depth too shallow

A producer fills the FIFO and blocks waiting for the consumer to drain it. If the consumer is itself waiting for data from another stream, neither task can make progress and the simulation hangs.

Fix: Increase the stream depth by providing the second template argument.

// Default depth of 2 — may deadlock under backpressure
tapa::stream<int> s("s");

// Larger depth gives the producer room to run ahead
tapa::stream<int, 32> s("s");

Start at the default depth of 2 and increase to 16 or 32 when you observe backpressure. In hardware, deeper FIFOs consume more BRAM, so avoid over-provisioning depth once correctness is confirmed.

2. Missing loop termination or element count mismatch

A writer sends fewer elements than the reader expects. The reader blocks indefinitely waiting for data that never arrives.

Fix: Verify that every producer sends exactly as many elements as the corresponding consumer reads. A common mistake is an off-by-one in loop bounds or a conditional write that skips elements.

3. Circular dependency between tasks

Task A waits for output from Task B before it can write to Task B's input. Task B waits for input from Task A before it can produce output. Neither can make progress.

Fix: Redesign the data flow to eliminate the cycle. If a feedback path is genuinely required, use try_read / try_write so that a task can make progress even when the channel is empty or full.

4. `async_mmap` write responses not drained

The write_resp FIFO fills up. Once full, the hardware stops accepting new write addresses and the kernel stalls.

Fix: Always drain write_resp inside the same pipelined loop that issues writes. Use non-blocking try_write / try_read so both issue and drain progress every cycle:

void WriteTask(tapa::async_mmap<int>& mem, tapa::istream<int>& data, int n) {
  for (int64_t i_req = 0, i_resp = 0; i_resp < n;) {
#pragma HLS pipeline II=1
    if (i_req < n && !data.empty() &&
        !mem.write_addr.full() && !mem.write_data.full()) {
      mem.write_addr.try_write(i_req);
      mem.write_data.try_write(data.read(nullptr));
      ++i_req;
    }
    uint8_t ack;
    if (mem.write_resp.try_read(ack)) {
      i_resp += unsigned(ack) + 1;  // ack encodes burst_length - 1
    }
  }
}

Splitting writes and response drain into separate loops risks deadlock: if write_resp fills before all writes are issued, the hardware stops accepting write addresses and the first loop never completes.

Isolation strategy

Run with TAPA_CONCURRENCY=1 to serialize all tasks into a single coroutine thread. This makes a hang deterministic and easier to reproduce and attach a debugger to.

TAPA_CONCURRENCY=1 ./vadd

If the hang disappears at concurrency 1 but reappears at the default concurrency, the issue is a scheduling race rather than a structural deadlock. Look for assumptions about task ordering that do not hold under concurrent scheduling.

Finding the blocked task

Attach GDB to the hung process to identify which task is stuck and on which operation.

gdb ./vadd

Let the binary run until it hangs, then interrupt it:

^C
(gdb) info threads
(gdb) thread apply all bt

The backtrace will show the call stack for every coroutine. Look for a frame inside a read or write call on a TAPA stream — the stream name in that frame identifies where flow has stopped.

Waveform debugging in fast cosim

Run cosim with a persistent work directory and waveform capture enabled so you can inspect the simulation state after a hang.

./vadd --bitstream=vadd.xo \
  -cosim_work_dir ./cosim_work \
  -xsim_save_waveform \
  1000

If the simulation hangs, press Ctrl-C to terminate it, then open the waveform in Vivado:

vivado -mode gui -source ./cosim_work/output/run/run_cosim.tcl

Inspect the AXI and stream signals to identify which channel is stalled. A valid signal held high with a ready signal held low indicates backpressure; a ready signal high with no valid indicates the producer has stopped sending.

Tip

Set TAPA_STREAM_LOG_DIR=/tmp/stream_logs before running. TAPA logs each value written to a stream into a file under that directory:

TAPA_STREAM_LOG_DIR=/tmp/stream_logs ./vadd

Each named stream gets its own log file. A stream with an empty or truncated log identifies where data flow stops.

Stream depth tuning reference

Symptom	Starting depth	Suggested increase
Hang with 2 tasks in a pipeline	2 (default)	16
Hang with deep pipeline (>4 stages)	16	32–64
Correctness issue, no hang	Any	Try 2 first to expose races

Increasing depth lets producers run further ahead of consumers and resolves backpressure-induced deadlocks. In hardware, each entry in a stream FIFO consumes flip-flops or BRAM. Once the design is functionally correct, profile resource usage and reduce depths where headroom allows.

See also: Common Errors | Cosimulation Issues

Cosimulation Issues

When to use this page: When --bitstream=vadd.xo (fast cosim) runs differently from software simulation, or when cosim produces xsim or Verilator errors.

Fast cosim vs software simulation mismatches

If fast cosim fails (FAIL! or hangs) but software simulation passes, the most common causes are:

Non-deterministic scheduling can expose races not visible in software simulation. Software simulation uses coroutine scheduling that runs tasks cooperatively; RTL runs tasks truly in parallel. Races that are hidden by cooperative scheduling in software simulation may surface as failures in fast cosim. Fix: remove any assumptions about task ordering that are not enforced by stream synchronization.
Blocking async_mmap operations inside pipelined loops. A blocking call inside a pipelined loop can stall the pipeline in RTL in ways that software simulation does not model. Fix: use non-blocking reads/writes and manually handle the response FIFOs, or switch to tapa::mmap to simplify the memory access model while debugging.
Non-deterministic task scheduling. Software simulation uses coroutine scheduling that may resolve races differently than RTL. If results depend on the relative timing of two tasks, they may differ between simulation and RTL.

Note

Fast cosim models DRAM with a simplified functional model. Throughput and latency numbers from fast cosim are not representative of on-board performance. Use fast cosim only to verify functional correctness.

HBM cross-channel access limitation

Warning

Fast cosimulation does not support cross-channel access for HBM. Each AXI interface can only access one HBM channel. Designs that require cross-channel HBM access must be validated on hardware rather than in fast cosim.

If your design uses multiple HBM pseudo-channels and the fast cosim result does not match software simulation, verify that no single AXI port accesses more than one HBM channel.

xsim issues

`xsim not found` or `Vivado not found`

xsim is part of the Vivado installation. Source the Vivado environment script before running cosim:

source /opt/Xilinx/Vivado/2022.1/settings64.sh
./vadd --bitstream=vadd.xo ...

Adjust the path to match your Vivado installation and version.

`xsim hangs at elaboration`

Check that the .xo file was produced by a successful tapa compile run. A partial or corrupt .xo (from a failed or interrupted compilation) can cause elaboration to hang silently. Re-run tapa compile from scratch and verify it exits with status 0 before running cosim.

Segfault inside xsim

This is typically a Vivado bug. Try switching to a different Vitis/Vivado version. Versions tested by the TAPA CI pipeline are listed in the TAPA repository's CI configuration.

Verilator issues

`verilator not found`

Install Verilator from your package manager or build from source:

# Debian/Ubuntu
sudo apt install verilator

Verilator compilation error (Verilog parsing error)

TAPA generates Verilog targeting recent Verilator versions. If you see Verilog parsing errors, update Verilator to the version used in TAPA's CI pipeline.

No waveform support with Verilator

Verilator simulation does not support waveform capture via the Vivado GUI. If you need waveform debugging, use xsim and pass -xsim_save_waveform as described below.

Cosim produces wrong output (`FAIL!`) but xsim does not hang

Run with waveform capture and a persistent work directory so you can inspect the simulation after it completes:

./vadd --bitstream=vadd.xo \
  -cosim_work_dir ./cosim_work \
  -xsim_save_waveform \
  1000

Then open the waveform in Vivado GUI:

vivado -mode gui -source ./cosim_work/output/run/run_cosim.tcl

In the waveform viewer, add the AXI memory interface signals and compare the expected vs actual data on each transaction. Look for read data that does not match what the host wrote, or write transactions that target unexpected addresses.

Stream diagnostics

The DPI runtime reports stream progress periodically when a stream stalls (empty on read or full on write). These messages appear on stderr and include the port name and queue state:

frt-dpi: progress[a_fifo_s]: read_ok=16 read_empty=40M write_ok=0 write_full=0 q_head=8 q_tail=8

Field	Meaning
`progress[port]`	The port that triggered the report (the one currently stalling).
`read_ok`	Total successful reads across all ports in this process.
`read_empty`	Total empty-read attempts (queue had no data).
`write_ok`	Total successful writes across all ports.
`write_full`	Total full-write attempts (queue had no space).
`q_head` / `q_tail`	Shared-memory queue counters for the stalling port. `q_tail` = elements pushed by the producer; `q_head` = elements popped by the consumer. `q_head == q_tail` means the queue is empty.

Enabling verbose per-element logging

Set the FRT_STREAM_DEBUG environment variable to log every successful stream read and write:

FRT_STREAM_DEBUG=1 ./vadd --bitstream=vadd.xo 1000

Interpreting stall patterns

q_tail=0 on a consumer port: the producer never wrote to this stream. Check that the producer's xsim started and that stream arguments are bound correctly.
q_head == q_tail but read_ok < expected: all produced elements were consumed but not enough were produced. The producer may have exited before flushing all writes.
write_full growing: the consumer is not draining fast enough. Check for deadlocks or increase TAPA_CONCURRENCY.

Tip

Always pass software simulation before running fast cosim. Software simulation runs faster and catches logic bugs in C++. Fast cosim catches RTL bugs introduced by synthesis and scheduling. Skipping software simulation wastes cosim time on bugs that are much faster to fix at the C++ level.

See also: Common Errors | Deadlocks & Hangs

CLI Commands

Reference for all tapa CLI subcommands. For task-oriented guides, see Build and Run and the other How-To pages. The general invocation form is:

tapa [global options] <subcommand> [subcommand options]

Note

tapa compile is a shortcut that runs tapa analyze, tapa synth, and tapa pack in sequence in a single command. When using the individual subcommands, pass --work-dir as a global flag before the subcommand name: tapa --work-dir DIR <subcommand>.

Global Options

These options must appear before the subcommand name.

Flag	Description
`--work-dir DIR` / `-w DIR`	Working directory for intermediate artifacts (default: `./work.out/`).
`--verbose` / `-v`	Increase logging verbosity. Repeatable (e.g., `-vv`).
`--quiet` / `-q`	Decrease logging verbosity.
`--remote-host user@host[:port]`	Remote Linux host where vendor tools run.
`--remote-key-file PATH`	SSH private key file for authenticating to the remote host.
`--remote-xilinx-settings PATH`	Path to `settings64.sh` on the remote host.
`--remote-ssh-control-dir DIR`	Local directory for SSH multiplex control sockets.
`--remote-ssh-control-persist DURATION`	How long the SSH master socket stays alive (default: `30m`).
`--remote-disable-ssh-mux`	Disable SSH connection multiplexing.

tapa compile

Run the full compilation pipeline (analyze → synth → pack) in a single command.

Required flags

Flag	Description
`--top FUNCTION` / `-t FUNCTION`	Top-level task function name.
`-f FILE`	Kernel source file.
`-o OUTPUT.xo`	Output XO file path.

Optional flags

Flag	Description
`--part-num PART`	Target FPGA part number (e.g., `xcu250-figd2104-2L-e`).
`--platform PLATFORM`	Vitis platform string. Alternative to `--part-num`.
`--clock-period NS`	Target clock period in nanoseconds.
`--target {xilinx-vitis,xilinx-hls,xilinx-aie}`	Output target (default: `xilinx-vitis`). `xilinx-aie` is experimental.
`-j N`	Number of parallel HLS jobs.
`--custom-rtl PATH`	Custom RTL file or directory to include in the XO.

Example

tapa compile \
  --top VecAdd \
  --part-num xcu250-figd2104-2L-e \
  --clock-period 3.33 \
  -f vadd.cpp \
  -o vadd.xo

tapa analyze

Parse C++ source and extract the task graph to a JSON file in the work directory. This stage always runs locally and does not require vendor tools.

Required flags

Flag	Description
`--top FUNCTION` / `-t FUNCTION`	Top-level task function name.
`-f FILE`	Kernel source file.

Optional flags

Flag	Description
`--target {xilinx-vitis,xilinx-hls,xilinx-aie}`	Output target (default: `xilinx-vitis`). Controls the synthesis flow. `xilinx-aie` is experimental.

Example

tapa --work-dir work.out analyze --top VecAdd -f vadd.cpp

tapa synth

Run Vitis HLS on each task to produce per-task Verilog RTL. Reads the task graph produced by tapa analyze from the work directory. Can run on a remote host via --remote-host.

Required flags

Flag	Description
`--part-num PART`	Target FPGA part number. Required if `--platform` is not set.
`--platform PLATFORM`	Vitis platform string. Required if `--part-num` is not set.

Optional flags

Flag	Description
`--clock-period NS`	Target clock period in nanoseconds. Can be derived from `--platform` if not set explicitly.
`-j N`	Number of parallel HLS jobs (default: number of physical CPU cores).
`--enable-synth-util`	Run post-HLS RTL synthesis to produce per-task resource utilization estimates.
`--nonpipeline-fifos JSON`	JSON specification of FIFOs for which pipeline registers should be suppressed.
`--gen-ab-graph`	Generate `ab_graph.json` for AutoBridge/RapidStream floorplanning.
`--gen-graphir`	Generate `graphir.json` for RapidStream.
`--floorplan-config PATH`	Path to the floorplan configuration file. Used with `--gen-ab-graph` or `--gen-graphir`.
`--device-config PATH`	Path to the device configuration file. Used with `--gen-graphir`.
`--floorplan-path PATH`	Path to an existing floorplan file to apply during synthesis. Requires `--flatten-hierarchy`.

Example

tapa --work-dir work.out synth \
  --part-num xcu250-figd2104-2L-e \
  --clock-period 3.33 \
  -j 4

tapa pack

Package per-task RTL from the work directory into a single output artifact. For the default xilinx-vitis target this produces an XO file; for other targets a ZIP file is produced. Reads RTL produced by tapa synth.

Optional flags

Flag	Description
`-o OUTPUT`	Output file path (default: `work.xo` for the Vitis target, `work.zip` for other targets).
`--custom-rtl PATH`	Custom RTL file or directory to include in the XO.

Example

tapa --work-dir work.out pack -o vadd.xo

tapa g++

Compile TAPA host and kernel C++ for software simulation. This is a wrapper around g++ that automatically sets the required TAPA include paths and link flags. All arguments after -- are forwarded directly to g++.

Example

tapa g++ -- vadd.cpp vadd-host.cpp -o vadd

See Software Simulation for how to run the resulting executable.

tapa version

Print the installed TAPA version.

tapa version

Runtime Flags

This page covers environment variables and host executable flags that control TAPA behavior at runtime. These apply after compilation, during software simulation or fast hardware cosimulation.

Environment Variables

These variables are read by the host executable at startup.

Variable	Default	Description
`TAPA_CONCURRENCY`	Number of CPU cores	Number of parallel coroutine threads used by software simulation. Set to `1` for single-threaded, more reproducible simulation runs. Has no effect on HLS compilation parallelism (`-j`).
`TAPA_STREAM_LOG_DIR`	(unset — logging disabled)	Directory for stream transfer logs. When set, TAPA writes one log file per named stream recording each value written to that stream. Useful for tracing data corruption during software simulation.
`FRT_STREAM_DEBUG`	(unset)	When set, log every successful stream read and write in the DPI layer. Produces high-volume output; use only for targeted debugging.
`FRT_COSIM_YIELD`	`1` (enabled)	When enabled, the DPI layer calls `thread::yield_now()` on empty reads or full writes. Disable with `0` to busy-wait instead.
`FRT_XSIM_LEGACY`	`0`	Set to `1` to use the legacy xelab command-line format for older Vivado versions.
`FRT_XOCL_BDF`	(unset)	PCIe Bus:Device:Function for XRT/OpenCL device selection. Equivalent to the `-xocl_bdf` gflag.

Example: reproducible single-threaded simulation

TAPA_CONCURRENCY=1 ./vadd

Example: enable stream logging

TAPA_STREAM_LOG_DIR=/tmp/stream-logs ./vadd

See Software Simulation for more on stream logging and debugging.

Host Executable Flags (Fast Cosim)

When the host executable is invoked with --bitstream=vadd.xo, it runs fast hardware cosimulation instead of software simulation. The following flags control cosim behavior. They are passed directly on the host executable command line.

Note

These flags use single-dash prefix (e.g., -cosim_work_dir) because they are parsed by the host executable via gflags.

Flag	Description
`-cosim_executable <path>`	Deprecated. Fast cosim now runs in-process via `libfrt`; this flag is ignored.
`-xsim_part_num <part>`	Target FPGA part number for simulation (e.g., `xcu280-fsvh2892-2L-e`).
`-cosim_work_dir <dir>`	Persistent working directory for simulation artifacts. Without this flag, a temporary directory is used and deleted after the run.
`-xsim_save_waveform`	Save simulation waveforms to a `.wdb` file in the work directory. Pair with `-cosim_work_dir`; without it, the temporary directory and all waveforms are deleted after the run.
`-xsim_start_gui`	Open the Vivado GUI for interactive debugging during simulation.
`-cosim_simulator <backend>`	Simulator backend: `xsim` (default, Linux only, requires Vivado) or `verilator` (cross-platform, no Vivado required).
`-cosim_setup_only`	Run simulation setup only, then stop before executing the simulation. Useful for inspecting generated simulation files before committing to a full run.
`-cosim_resume_from_post_sim`	Skip re-running the simulation and jump directly to post-simulation checks. Use after a completed simulation to re-run checks without re-simulating.
`-cosim_work_dir_parallel`	Create a unique subdirectory per instance when running multiple concurrent simulations, preventing work directory collisions.

Example: save waveforms from a named work directory

./vadd --bitstream vadd.xo \
  -cosim_work_dir ./cosim_work \
  -xsim_save_waveform \
  1000

Example: staged workflow (setup then resume)

# Step 1: set up and inspect the simulation environment
./vadd --bitstream vadd.xo -cosim_work_dir ./cosim_work -cosim_setup_only 1000

# Step 2: run post-simulation checks without re-simulating
./vadd --bitstream vadd.xo -cosim_work_dir ./cosim_work -cosim_resume_from_post_sim 1000

For a full walkthrough of fast cosim workflows, see Fast Hardware Simulation.

C++ API

This page documents the TAPA C++ library (#include <tapa.h>). Types and functions live in the tapa namespace unless noted otherwise.

Task Invocation

`tapa::task`

The task hierarchy builder. An upper-level task constructs a tapa::task and chains .invoke() calls on it. The tapa::task destructor waits for all joined child instances to finish before returning.

struct task {
  // Invoke func with the given arguments using the default join mode.
  template <typename Func, typename... Args>
  task& invoke(Func&& func, Args&&... args);

  // Invoke func with an explicit mode (tapa::join or tapa::detach).
  template <internal::InvokeMode mode, typename Func, typename... Args>
  task& invoke(Func&& func, Args&&... args);

  // Invoke func N times with the given mode.
  template <internal::InvokeMode mode, int N, typename Func, typename... Args>
  task& invoke(Func&& func, Args&&... args);
};

Invoke modes:

Mode	Behavior
`tapa::join` (default)	The task runs concurrently with siblings; the parent waits for it to finish before returning.
`tapa::detach`	Fire-and-forget; the parent does not wait for the task to finish. Use with care — the parent may return before the detached task completes.

Example:

void Top(tapa::istream<float>& in, tapa::ostream<float>& out, int n) {
  tapa::task()
      .invoke(LoadData, in, n)
      .invoke<tapa::detach>(MonitorTask, n)
      .invoke(StoreData, out, n);
}

`tapa::seq`

A sequential index generator. When tapa::seq{} is passed as an argument to .invoke() with a repeat count N, each invocation receives a unique integer (0, 1, 2, …, N−1). Use this to distribute indexed work across task instances, such as assigning each instance its slice of a stream array.

tapa::streams<float, 4> channels;
tapa::task().invoke<tapa::join, 4>(Worker, channels, tapa::seq{});
// Worker instance 0 gets channel[0], instance 1 gets channel[1], etc.

`tapa::executable`

Wraps a path to an XO or bitstream file for use in .invoke(). When an executable is passed as the second argument to .invoke(), the task runs on hardware (via FRT) instead of in software simulation.

class executable {
 public:
  explicit executable(std::string path);
};

Usage:

tapa::task().invoke(MyKernel, tapa::executable("my_kernel.xo"), arg1, arg2);

Streams

Streams are the fundamental inter-task communication primitive. Each stream is a fixed-depth FIFO. Blocking operations stall until data or space is available; non-blocking operations return immediately.

`tapa::stream<T, Depth>`

Bidirectional FIFO that owns the underlying storage. Declared inside an upper-level task and passed to child tasks as istream<T>& (read end) or ostream<T>& (write end). The default depth is 2.

template <typename T, uint64_t Depth = 2>
class stream;

`tapa::istream<T>`

Read-only view of a stream. Always passed by reference in task signatures: tapa::istream<T>&.

Method	Blocking	Destructive	Description
`read()`	yes	yes	Blocks until an element is available, then returns it.
`read(bool& ok)`	no	yes	Non-blocking read; sets `ok` to true if an element was consumed.
`try_read(T& val)`	no	yes	Non-blocking read; returns true and writes to `val` if successful.
`peek(bool& ok)`	no	no	Returns the next element without consuming it; sets `ok`.
`try_peek(T& val)`	no	no	Non-blocking peek; returns true if data was available.
`empty()`	no	no	Returns true if the stream contains no elements.
`eot(bool& ok)`	no	no	Returns true if the head element is an end-of-transaction marker.
`open()`	yes	yes	Blocks until an EoT marker arrives, then consumes it. Used to receive stream closure.
`try_open()`	no	yes	Non-blocking variant of `open()`; returns true if EoT was consumed.

`tapa::ostream<T>`

Write-only view of a stream. Always passed by reference in task signatures: tapa::ostream<T>&.

Method	Blocking	Destructive	Description
`write(const T& val)`	yes	yes	Blocks until space is available, then writes `val`.
`try_write(const T& val)`	no	yes	Non-blocking write; returns true if the element was written.
`full()`	no	no	Returns true if the stream is full.
`close()`	yes	yes	Writes an end-of-transaction marker; blocks until space is available.
`try_close()`	no	yes	Non-blocking variant of `close()`; returns true if the EoT was written.

`tapa::streams<T, N, Depth>`

Array of N streams of type T, each with depth Depth. Declared in an upper-level task and unpacked by index when passed to child tasks.

`tapa::istreams<T, N>` / `tapa::ostreams<T, N>`

Array of N read-only or write-only stream views. Always passed by reference in task signatures.

Note

All stream types (istream, ostream, istreams, ostreams) must be passed by reference in task signatures. Passing by value is a compile error.

Memory (mmap)

`tapa::mmap<T>`

A pointer-like handle for synchronous bulk memory access. Backed by a contiguous host allocation. In a task signature, tapa::mmap<T> is passed by value.

template <typename T>
class mmap {
 public:
  explicit mmap(T* ptr);
  mmap(T* ptr, uint64_t size);
  template <typename Container>
  explicit mmap(Container& container);  // accepts std::vector etc.

  T* data() const;
  uint64_t size() const;

  template <uint64_t N>
  mmap<vec_t<T, N>> vectorized() const;  // reinterpret as wider element type

  template <typename U>
  mmap<U> reinterpret() const;  // reinterpret element type
};

`tapa::async_mmap<T>`

Decoupled memory access type. Instead of blocking on each memory operation, the kernel issues read/write requests and collects responses through five FIFO channels. This allows the kernel to pipeline memory operations. Passed by reference in task signatures: tapa::async_mmap<T>&.

See async_mmap channels below for channel details.

`tapa::mmaps<T, N>`

Array of N tapa::mmap<T> regions. Passed by value as a single argument and unpacked by the framework one region per child invocation.

template <typename T, uint64_t N>
class mmaps;

Directional mmap wrappers (host-side only)

Used in the top-level tapa::invoke() call to express direction hints. The kernel task signature uses plain tapa::mmap<T> or tapa::mmaps<T, N>.

Wrapper	Direction
`tapa::read_only_mmap<T>`	Host writes, kernel reads
`tapa::write_only_mmap<T>`	Kernel writes, host reads
`tapa::read_write_mmap<T>`	Both read and write
`tapa::placeholder_mmap<T>`	No direction hint
`tapa::read_only_mmaps<T, N>`	Array variant of `read_only_mmap`
`tapa::write_only_mmaps<T, N>`	Array variant of `write_only_mmap`
`tapa::read_write_mmaps<T, N>`	Array variant of `read_write_mmap`

`tapa::aligned_allocator<T>`

STL-compatible allocator that returns page-aligned memory suitable for DMA transfers. Use this with std::vector when allocating host buffers that will be passed to a kernel.

std::vector<float, tapa::aligned_allocator<float>> buf(n);
tapa::invoke(MyKernel, bitstream, tapa::read_only_mmap<float>(buf), n);

async_mmap Channels

tapa::async_mmap<T> exposes five public member channels. The kernel writes addresses to the request channels and reads results from the response channels. All channel operations are non-blocking where prefixed with try_.

Channel	Type	Direction	Description
`read_addr`	`ostream<int64_t>`	kernel → memory	Write an element index to request a read. The framework converts the index to a byte offset internally.
`read_data`	`istream<T>`	memory → kernel	Read the data returned by a previously issued read request.
`write_addr`	`ostream<int64_t>`	kernel → memory	Write an element index to request a write.
`write_data`	`ostream<T>`	kernel → memory	Write the data to be written at the requested address.
`write_resp`	`istream<uint8_t>`	memory → kernel	Drain write-completion acknowledgements. Each response value encodes `burst_length - 1` (i.e., a value of 0 means one write completed, 255 means 256 writes completed).

Warning

The kernel must drain write_resp to avoid deadlock. If the response channel fills up, the memory subsystem stops issuing further write completions and the kernel stalls.

Typical async_mmap read pattern:

void Reader(tapa::async_mmap<float>& mem, tapa::ostream<float>& out, int n) {
#pragma HLS pipeline II=1
  for (int i_req = 0, i_resp = 0; i_resp < n;) {
    if (i_req < n && !mem.read_addr.full()) {
      mem.read_addr.write(i_req);
      ++i_req;
    }
    float val;
    if (mem.read_data.try_read(val)) {
      out.write(val);
      ++i_resp;
    }
  }
}

Utilities

`tapa::vec_t<T, N>`

An N-element SIMD vector of type T. Stores elements as a packed bit array, which maps directly to wide AXI ports. Supports element access via operator[], arithmetic operators element-wise, and common reductions (sum, product).

template <typename T, int N>
struct vec_t {
  static constexpr int length = N;
  static constexpr int width = widthof<T>() * N;  // total bit width

  T& operator[](int pos);
  const T& operator[](int pos) const;
};

Related free functions: truncated<begin, end>(vec), cat(v1, v2), make_vec<N>(val).

`tapa::widthof<T>()`

Returns the bit width of type T. For ap_int<W> and ap_uint<W>, returns W. For plain C++ types, returns sizeof(T) * CHAR_BIT.

template <typename T>
inline constexpr int widthof();

template <typename T>
inline constexpr int widthof(T object);  // deduce T from argument

EoT macros

End-of-transaction macros simplify consuming a stream until a sentinel marker is received.

Macro	Description
`TAPA_WHILE_NOT_EOT(stream)`	Loop body executes once per data element; loop exits when the EoT marker is seen.
`TAPA_WHILE_NEITHER_EOT(s1, s2)`	Two-stream variant; exits when either stream reaches EoT.
`TAPA_WHILE_NONE_EOT(s1, s2, s3)`	Three-stream variant.

// Example: consume all elements from 'in' and forward to 'out'
TAPA_WHILE_NOT_EOT(in) {
  out.write(in.read());
}
in.open();   // consume the EoT marker
out.close(); // send EoT marker downstream

Synthesis pragmas (C++ attributes)

These C++ attributes are recognised by TAPA and lowered to Vitis HLS pragmas during synthesis. They have no effect in software simulation.

Attribute	Description
`[[tapa::pipeline(II)]]`	Pipeline the enclosing loop or function with initiation interval `II`.
`[[tapa::unroll(factor)]]`	Unroll the enclosing loop by `factor`.
`[[tapa::target("ignore")]]`	Mark a task for custom RTL replacement. TAPA generates a port-signature template but does not synthesize the task body.

Note

[[tapa::target("ignore")]] was formerly written as [[tapa::target("non_synthesizable", "xilinx")]]. The "ignore" form is the current spelling.

`tapa::hls` sub-namespace

tapa::hls::stream<T> is a stream type that behaves like hls::stream<T> in software simulation: it has effectively infinite depth, so producers never block in simulation. Use it when incrementally migrating a Vitis HLS design and you want software simulation to pass without tuning stream depths. #include <tapa.h> includes this automatically.

Note

tapa::hls::stream synthesizes to the same RTL FIFO as tapa::stream<T, N> with the declared depth N. The infinite depth only applies to software simulation. The practical reason to replace it before hardware build is that software simulation with tapa::hls::stream will not expose backpressure bugs — switching to tapa::istream<T>& / tapa::ostream<T>& with a tuned depth catches those bugs at simulation time rather than on hardware.

Output Files

Output Artifacts

The artifact produced by tapa depends on the target selected with --target.

Xilinx Vitis target (--target xilinx-vitis, the default)

Produces an .xo object file. This is passed to the Vitis v++ compiler for bitstream generation. An XO file is a ZIP archive; you can unzip it to inspect or manually edit the RTL it contains, then re-zip it before passing it to v++.

Xilinx HLS target (--target xilinx-hls)

Produces a .zip RTL archive instead of an .xo file. The archive contains the same RTL files and metadata but without the Vitis shell wrapper. Use this when the RTL is consumed directly by a downstream EDA tool.

Reproducibility

TAPA strips timestamps, absolute paths, and random IDs from both .xo and .zip artifacts before writing them to disk. Given the same source code and tool versions, repeated invocations produce byte-identical output. This makes the artifacts suitable for CI and release attestation workflows.

Note

Byte identity holds only within the same vendor tool version. Upgrading Vitis HLS or Vivado will typically change internal artifact content even for identical source inputs.

Intermediate Files

When --work-dir is specified (recommended), TAPA writes intermediate files to that directory. The structure is:

work.out/
├── cpp/
├── flatten/
├── log/
├── tar/
├── hdl/
├── graph.json
├── settings.json
├── report.json
└── report.yaml

File and directory descriptions

cpp/

Contains per-task C++ source files extracted by tapa analyze. Each file is independently compiled to RTL by vitis_hls.

flatten/

Created during tapa analyze. Contains preprocessed (flattened) copies of the input source files, one per input file, with a short hash prefix in the filename to avoid collisions. All #include directives are expanded and comments are preserved, giving tapacc self-contained translation units to operate on.

log/

Stores logs from processing steps, including vitis_hls csynth_design logs.

tar/

Contains one .tar archive per task. Each archive holds the output of csynth_design for that task.

hdl/

Stores RTL files for all tasks generated by vitis_hls, plus TAPA-specific infrastructure RTL.

graph.json

JSON file recording all contents and metadata of the input design, including the task graph structure.

settings.json

Records compilation settings shared across pipeline steps (target, part number, clock period, platform). Downstream tapa sub-commands read this file to avoid repeating options on the command line.

report.json / report.yaml

Post-synthesis resource utilisation report, written unconditionally after tapa synth completes. Both files contain the same data in JSON and YAML encoding respectively. Passing --enable-synth-util to tapa synth additionally generates per-task .hier.util.rpt files under tar/, but does not affect whether these top-level report files are written.

C++ Quick Reference

Common patterns for writing TAPA kernels. For full API details see C++ API.

Task structure

// Upper-level task: declare streams, invoke leaf tasks. No computation.
void Top(tapa::mmap<const float> in, tapa::mmap<float> out, uint64_t n) {
  tapa::stream<float, 16> q("q");
  tapa::task()
      .invoke(Load, in, n, q)
      .invoke(Store, q, out, n);
}

// Leaf task: contains all computation.
void Load(tapa::mmap<const float> mem, uint64_t n, tapa::ostream<float>& q) {
  for (uint64_t i = 0; i < n; ++i) q.write(mem[i]);
}

void Store(tapa::istream<float>& q, tapa::mmap<float> mem, uint64_t n) {
  for (uint64_t i = 0; i < n; ++i) mem[i] = q.read();
}

Host code

#include <gflags/gflags.h>
#include <tapa.h>

DEFINE_string(bitstream, "", "XO or xclbin path. Empty = software simulation.");

int main(int argc, char* argv[]) {
  gflags::ParseCommandLineFlags(&argc, &argv, true);

  std::vector<float, tapa::aligned_allocator<float>> a(n), b(n);

  tapa::invoke(Top, FLAGS_bitstream,
               tapa::read_only_mmap<const float>(a),
               tapa::write_only_mmap<float>(b),
               (uint64_t)n);
}

`FLAGS_bitstream` value	Backend
(empty)	Software simulation
`kernel.xo`	Fast cosimulation
`kernel.hw.xclbin`	On-board execution

Stream types

Type	Use in signature	Direction
`tapa::stream<T, Depth>`	local variable in upper task	owner
`tapa::istream<T>&`	leaf task parameter	read only
`tapa::ostream<T>&`	leaf task parameter	write only
`tapa::streams<T, N>`	local variable	array owner
`tapa::istreams<T, N>&`	leaf task parameter	array read
`tapa::ostreams<T, N>&`	leaf task parameter	array write

// Read
T val = in.read();             // blocking
bool ok = in.try_read(val);    // non-blocking, returns true on success

// Write
out.write(val);                // blocking
bool ok = out.try_write(val);  // non-blocking

// State checks
bool e = in.empty();
bool f = out.full();

// End-of-transaction
out.close();                   // send EoT marker
in.open();                     // consume EoT marker
TAPA_WHILE_NOT_EOT(in) { ... } // loop until EoT

Stream depth and FPGA resource:

Depth	Resource
< 128	SRL shift-register (no BRAM)
≥ 128	BRAM
≥ 4096 and element width ≥ 36 b	URAM

Memory types

Type	Signature	Access style
`tapa::mmap<T>`	by value	synchronous, pointer-like
`tapa::async_mmap<T>`	by reference `&`	decoupled AXI channels

// mmap — simple loop
for (int i = 0; i < n; ++i) out[i] = in[i];

// async_mmap — overlapping reads (two-counter loop)
for (int64_t i_req = 0, i_resp = 0; i_resp < n;) {
#pragma HLS pipeline II=1
  if (i_req < n) mem.read_addr.try_write(i_req++);
  T val;
  if (mem.read_data.try_read(val)) result[i_resp++] = val;
}

// async_mmap — writes with response drain
for (int64_t i_req = 0, i_resp = 0; i_resp < n;) {
#pragma HLS pipeline II=1
  if (i_req < n && !src.empty() &&
      !mem.write_addr.full() && !mem.write_data.full()) {
    mem.write_addr.try_write(i_req);
    mem.write_data.try_write(src.read(nullptr));
    ++i_req;
  }
  uint8_t ack;
  if (mem.write_resp.try_read(ack)) i_resp += unsigned(ack) + 1;
}

Parallel task instances

// Invoke N instances; each gets a unique index via tapa::seq
tapa::streams<float, 4> ch("ch");
tapa::task().invoke<tapa::join, 4>(Worker, ch, tapa::seq{});

void Worker(tapa::istream<float>& in, int idx) { /* ... */ }

Useful pragmas

#pragma HLS pipeline II=1      // pipeline loop with II=1
#pragma HLS unroll factor=4    // partially unroll loop

// C++ attribute equivalents
[[tapa::pipeline(1)]]
[[tapa::unroll(4)]]
[[tapa::target("ignore")]]     // mark task for custom RTL replacement

End-of-transaction macros

TAPA_WHILE_NOT_EOT(in)          { out.write(in.read(nullptr)); }
TAPA_WHILE_NEITHER_EOT(in1,in2) { /* both have data */ }
TAPA_WHILE_NONE_EOT(a, b, c)    { /* all three have data */ }

Build and run

# Software simulation
tapa g++ -- kernel.cpp host.cpp -o app
./app

# RTL synthesis
tapa compile --top Top --part-num xcu250-figd2104-2L-e \
  --clock-period 3.33 -f kernel.cpp -o kernel.xo

# Fast cosimulation
./app --bitstream=kernel.xo

# Bitstream link (v++)
v++ -o app.hw.xclbin --link --target hw --kernel Top \
  --platform xilinx_u250_gen3x16_xdma_4_1_202210_1 kernel.xo

# On-board run
./app --bitstream=app.hw.xclbin

Publications

Papers describing the TAPA compiler, the physical design toolflow it integrates, and accelerators built with TAPA.

Core Publications

TAPA Compiler

Yuze Chi, Licheng Guo, Jason Lau, Young-kyu Choi, Jie Wang, Jason Cong. Extending High-Level Synthesis for Task-Parallel Programs. IEEE FCCM, 2021. [PDF] [Code]

Introduces the TAPA task API, coroutine-based software simulation (3.2× faster than Vitis HLS sequential simulation), and fast hierarchical RTL generation (6.8× faster QoR iteration). Reduces kernel and host code by 22% and 51% on average versus Vitis HLS dataflow.

Licheng Guo, Yuze Chi, Jason Lau, Linghao Song, Xingyu Tian, Moazin Khatti, Weikang Qiao, Jie Wang, Ecenur Ustun, Zhenman Fang, Zhiru Zhang, Jason Cong. TAPA: A Scalable Task-Parallel Dataflow Programming Framework for Modern FPGAs with Co-optimization of HLS and Physical Design. ACM TRETS, 2023. [PDF] [Code]

Full journal treatment of the TAPA compiler and runtime. Average frequency improves from 147 MHz to 297 MHz (102%) across 43 designs; 16 previously unroutable designs achieve 274 MHz on average after co-optimization with physical design.

Floorplanning and Physical Design

Licheng Guo, Yuze Chi, Jie Wang, Jason Lau, Weikang Qiao, Ecenur Ustun, Zhiru Zhang, Jason Cong. AutoBridge: Coupling Coarse-Grained Floorplanning and Pipelining for High-Frequency HLS Design on Multi-Die FPGAs. ACM/SIGDA FPGA, 2021. (Best Paper Award) [PDF] [Code]

Doubles achievable clock frequency on average by automatically floorplanning HLS dataflow designs across SLR boundaries and inserting pipeline registers. Now maintained exclusively as a plug-in of the TAPA workflow.

Licheng Guo, Pongstorn Maidee, Yun Zhou, Chris Lavin, Jie Wang, Yuze Chi, Weikang Qiao, Alireza Kaviani, Zhiru Zhang, Jason Cong. RapidStream: Parallel Physical Implementation of FPGA HLS Designs. ACM/SIGDA FPGA, 2022. (Best Paper Award) [PDF]

Split compilation with parallel placement and routing per partition. Achieves 5–7× compile time reduction and up to 1.3× frequency increase on Xilinx U250.

Licheng Guo, Pongstorn Maidee, Yun Zhou, Chris Lavin, Eddie Hung, Wuxi Li, Jason Lau, Weikang Qiao, Yuze Chi, Linghao Song, Yuanlong Xiao, Alireza Kaviani, Zhiru Zhang, Jason Cong. RapidStream 2.0: Automated Parallel Implementation of Latency-Insensitive FPGA Designs through Partial Reconfiguration. ACM TRETS, 2023. [Link]

Extends RapidStream with virtual pins and partial reconfiguration. Achieves 5–7× compile time reduction and 1.3× frequency increase on Xilinx U280, approximately 2× faster than RapidStream 1.0.

Jason Lau, Yuanlong Xiao, Yutong Xie, Yuze Chi, Linghao Song, Shaojie Xiang, Michael Lo, Zhiru Zhang, Jason Cong, Licheng Guo. RapidStream IR: Infrastructure for FPGA High-Level Physical Synthesis. IEEE/ACM ICCAD, 2024. [PDF]

Generalizes RapidStream into a reusable IR for FPGA high-level physical synthesis. Supports multiple task-parallel HLS frontends including TAPA and PASTA.

Compiler Extensions

Young-kyu Choi, Yuze Chi, Jason Lau, Jason Cong. TARO: Automatic Optimization for Free-Running Kernels in FPGA High-Level Synthesis. IEEE TCAD, 2022. [Link]

Eliminates unnecessary control logic for streaming applications. Achieves 16% LUT and 45% FF reduction on systolic-array designs on Alveo U250. Integrated into the TAPA compilation flow.

Neha Prakriya, Yuze Chi, Suhail Basalama, Linghao Song, Jason Cong. TAPA-CS: Enabling Scalable Accelerator Design on Distributed HBM-FPGAs. ACM ASPLOS, 2024. [arXiv] [Code]

Extends TAPA to automatically partition designs across a cluster of FPGAs with the --multi-fpga N compiler flag. Handles congestion control, resource balancing, and inter-FPGA pipelining.

Moazin Khatti, Xingyu Tian, Yuze Chi, Licheng Guo, Jason Cong, Zhenman Fang. PASTA: Programming and Automation Support for Scalable Task-Parallel HLS Programs on Modern Multi-Die FPGAs. IEEE FCCM, 2023; extended in ACM TRETS, 2024. [Link]

Adds automated latency-insensitive buffer (ping-pong) channel synthesis alongside FIFO streams in the task-parallel HLS flow, targeting the same class of multi-die FPGA designs as TAPA.

Suhail Basalama, Jason Cong. Stream-HLS: Towards Automatic Dataflow Acceleration. ACM/SIGDA FPGA, 2025. [Paper] [Code]

MLIR-based compiler that takes PyTorch or C/C++ and automatically generates optimized TAPA dataflow accelerators. Outperforms prior automation frameworks by up to 79× and manually-optimized TAPA designs by up to 11× geometric mean.

Akhil Raj Baranwal, Zhenman Fang. PoCo: Extending Task-Parallel HLS Programming with Shared Multi-Producer Multi-Consumer Buffer Support. ACM TRETS, 2025. [PDF]

Generalizes TAPA and PASTA's point-to-point SPSC channels to shared multi-producer–multi-consumer buffer abstractions with placement-aware optimizations for multi-die FPGAs.

Application Papers

Accelerators built with the TAPA compiler and toolflow.

Sparse Linear Algebra

Linghao Song, Yuze Chi, Atefeh Sohrabizadeh, Young-kyu Choi, Jason Lau, Jason Cong. Sextans: A Streaming Accelerator for General-Purpose Sparse-Matrix Dense-Matrix Multiplication. ACM/SIGDA FPGA, 2022. [PDF] [Code]

SpMM accelerator on Alveo U280/U250. TAPA/AutoBridge-compiled DDR variant achieves 260 MHz versus a Vivado baseline of 189 MHz. Up to 2.50× geomean speedup over NVIDIA K80.

Linghao Song, Yuze Chi, Licheng Guo, Jason Cong. Serpens: A High Bandwidth Memory Based Accelerator for General-Purpose Sparse Matrix-Vector Multiplication. ACM/IEEE DAC, 2022. [Code]

SpMV accelerator on Alveo U280 using 24 HBM channels. The Vitis HLS baseline failed to route; TAPA + AutoBridge achieves 270 MHz and up to 60.55 GFLOP/s.

Linghao Song, Licheng Guo, Suhail Basalama, Yuze Chi, Robert F. Lucas, Jason Cong. Callipepla: Stream Centric Instruction Set and Mixed Precision for Accelerating Conjugate Gradient Solver. ACM/SIGDA FPGA, 2023. [Code]

Conjugate gradient solver on U280 HBM. 3.94× speedup and 2.94× better energy efficiency over Xilinx XcgSolver; 3.34× better energy efficiency and 77% throughput of an A100 GPU at 4× lower memory bandwidth. Built with TAPA and AutoBridge.

Zifan He, Linghao Song, Robert F. Lucas, Jason Cong. LevelST: Stream-based Accelerator for Sparse Triangular Solver. ACM/SIGDA FPGA, 2024. [Paper] [Code]

First HBM-FPGA accelerator for SpTRSV. 2.65× speedup and 9.82× higher energy efficiency versus V100/RTX 3060 with cuSPARSE. Built on TAPA with AutoBridge floorplanning.

Manoj B. Rajashekar, Xingyu Tian, Zhenman Fang. HiSpMV / MAD-HiSpMV: Hybrid Row Distribution and Vector Buffering for Imbalanced SpMV Acceleration on FPGAs. ACM/SIGDA FPGA, 2024; extended in ACM TRETS, 2025. [Paper] [Code]

SpMV accelerator on Alveo U280 adapting row distribution to matrix structure. Uses TAPA for hardware build, cosimulation, and hardware emulation.

Ahmad Sedigh Baroughi, Xingyu Tian, Moazin Khatti, Akhil Raj Baranwal, Yuze Chi, Licheng Guo, Jason Cong, Zhenman Fang. HiSpMM: High Performance High Bandwidth Sparse-Dense Matrix Multiplication on HBM-equipped FPGAs. ACM TRETS, 2025. [Paper] [Code]

SpMM accelerator on Alveo U280 using TAPA for hardware generation, cosimulation, and runtime.

Graph Analytics

Yuze Chi, Licheng Guo, Jason Cong. Accelerating SSSP for Power-Law Graphs (SPLAG). ACM/SIGDA FPGA, 2022. [Paper] [Code]

FPGA SSSP accelerator on Alveo U280. Up to 4.9× over prior FPGA accelerators, 2.6× over a 32-thread CPU, 0.9× of A100 GPU at 4.1× the power budget. Fully parameterized TAPA HLS C++ implementation.

Systolic Arrays and Machine Learning

Jie Wang, Licheng Guo, Jason Cong. AutoSA: A Polyhedral Compiler for High-Performance Systolic Arrays on FPGA. ACM/SIGDA FPGA, 2021. [Paper] [Code]

Polyhedral systolic array compiler targeting MM, CNN, LU, MTTKRP. Integrated with TAPA and AutoBridge for routing congestion resolution and frequency improvement.

Suhail Basalama, Atefeh Sohrabizadeh, Jie Wang, Licheng Guo, Jason Cong. FlexCNN: An End-to-End Framework for Composing CNN Accelerators on FPGA. ACM TRETS, 2023. [Paper] [Code]

CNN compilation framework for OpenPose, U-Net, E-Net, and VGG-16 on Alveo U250/U280. TAPA code generation added as a journal contribution. 2.3× performance improvement; 5× further speedup via software-hardware pipelining.

K-Nearest Neighbors

Alec Lu, Zhenman Fang, Nazanin Farahpour, Lesley Shannon. CHIP-KNN: A Configurable and High-Performance K-Nearest Neighbors Accelerator on Cloud FPGAs. IEEE ICFPT, 2020. [Code]

KNN accelerator on Alveo U280. TAPA-compiled design achieves 252 MHz versus a Vivado baseline of 165 MHz.

Kenneth Liu, Alec Lu, Kartik Samtani, Zhenman Fang, Licheng Guo. CHIP-KNNv2: A Configurable and High-Performance K-Nearest Neighbors Accelerator on HBM-based FPGAs. ACM TRETS, 2023. [Paper] [Code]

Streaming-based redesign on Alveo U280 with automated TAPA HLS C code generation. Up to 45× speedup over a 48-thread CPU.

Multi-FPGA Applications

Tianqi Zhang, Neha Prakriya, Sumukh Pinge, Jason Cong, Tajana Rosing. SpectraFlux: Harnessing the Flow of Multi-FPGA in Mass Spectrometry Clustering. ACM/IEEE DAC, 2024. [Paper]

Uses TAPA-CS to partition a mass spectrometry clustering workload across multiple networked HBM-FPGAs.

Glossary

analyze

The tapa analyze step. Parses the C++ source with tapacc (a Clang-based tool) and extracts the task graph and inter-task channels to graph.json in the work directory. This step does not invoke any vendor tools and runs on any host.

async_mmap

A decoupled memory access type (tapa::async_mmap<T>). Instead of stalling on each memory operation, the kernel issues requests through address FIFOs and collects results through data and response FIFOs independently. This decoupling allows the kernel to keep the memory bus busy even when computation is not complete, enabling higher effective memory bandwidth. async_mmap must be passed by reference in task signatures.

backpressure

The condition where a producer cannot write to a stream because the downstream consumer has not yet drained elements from the FIFO and the buffer is full. The producer blocks until the consumer reads at least one element. Backpressure propagates naturally through TAPA streams and is the primary flow-control mechanism.

cosim (see also: fast cosim)

Hardware cosimulation. Runs RTL simulation using the XO artifact to verify the hardware implementation against the software model. TAPA supports fast cosim, which uses the XO directly without running full Vivado implementation. See also: fast cosim.

detached task

A task invoked with .invoke<tapa::detach>(). A detached task runs concurrently with its siblings but the parent does not wait for it to finish before returning. Useful for background tasks such as monitors or credit managers. See tapa::task in the API reference.

EoT (end-of-transaction)

A sentinel value written to a stream to signal the end of a data sequence. The producer calls ostream::close() to write the EoT marker; the consumer calls istream::open() to consume it. The TAPA_WHILE_NOT_EOT macro automates looping until EoT is detected.

fast cosim

Synonym for cosim in the TAPA context. Fast cosim is invoked by passing a .xo file as the --bitstream argument to the host executable. The host executable runs the Rust libfrt cosim runtime in-process, which avoids a full Vivado implementation run and is significantly faster than traditional cosim flows.

leaf task

A task that contains only computation and does not call .invoke(). Leaf tasks are the units of synthesis: each leaf task is compiled to RTL by Vitis HLS independently. A leaf task may use streams, mmap, or async_mmap parameters.

mmap

Memory-mapped region. A contiguous block of host memory exposed to the kernel as a pointer-like handle (tapa::mmap<T>). The kernel accesses it synchronously, similar to a C pointer. For pipelined non-blocking access, use async_mmap instead. mmap is passed by value in task signatures.

mmaps

An array of N mmap regions (tapa::mmaps<T, N>) passed as a single argument. The framework distributes one region per child task invocation when the parent iterates over N instances.

pack

The tapa pack step. Packages per-task RTL produced by tapa synth into a single XO (or ZIP) artifact suitable for passing to v++ or for use in fast cosim.

remote execution

Offloading vendor-tool steps (HLS, pack) to a remote Linux host over SSH. Configured with --remote-host. The local machine runs tapacc (the analyze step) and transfers source files; the remote host runs Vitis HLS. Useful when cross-compiling from macOS or when the local machine lacks a Vitis licence.

stream

A FIFO channel between tasks (tapa::stream<T, Depth>). Streams are the fundamental communication primitive in TAPA. A stream is declared in an upper-level task and passed to child tasks as istream<T>& (read end) or ostream<T>& (write end). The FIFO enforces backpressure automatically.

stream depth

The number of elements the FIFO can hold before the producer blocks. Declared as the second template parameter of tapa::stream<T, Depth>. The default depth is 2. Increasing depth decouples producer and consumer and can improve throughput at the cost of FPGA BRAM or LUT resources.

synth

The tapa synth step. Runs Vitis HLS on each leaf task extracted during tapa analyze to produce per-task Verilog RTL. Results are stored in tar/ and hdl/ under the work directory.

TAPA_CONCURRENCY

Environment variable controlling the number of coroutine threads used during software simulation. Set to 1 to force sequential execution (useful for debugging). The default is the number of physical CPU cores on the host machine.

top-level task (upper-level task)

A task that only invokes other tasks via tapa::task().invoke() and contains no direct computation. A top-level task maps to a system-level wrapper in RTL that wires sub-task ports together. The top-level task is specified with --top on the tapa command line.

work directory

The directory where TAPA stores all intermediate artifacts between pipeline steps. Set with --work-dir. The default is work.out/ in the current directory. See Output Files for the full directory structure.

xclbin

Xilinx compiled binary. The final bitstream file produced by Vivado implementation. An xclbin is loaded onto the FPGA by the host application at runtime (via XRT or FRT). It is produced by running v++ --link on an XO file.

Xilinx object file. The intermediate artifact produced by tapa pack, containing all per-task RTL and metadata in a ZIP archive. The XO is the input to v++ --link for bitstream generation, and is also passed as --bitstream to the host executable for fast hardware cosimulation.

Building from Source

Note

This guide is for developers contributing to or extending TAPA, or advanced users building TAPA from source for custom OS support. For FPGA accelerator development with TAPA, refer to the User Documentation. This is also the recommended way to install TAPA for all users.

Tip

If your OS isn't officially supported, consider using a virtual machine or file a feature request on GitHub.

System Prerequisites

To build TAPA from source, you need:

Bazel 7.3.2 or later
Binutils 2.30 or later
Git
Libstdc++ matching the most recent GCC version installed on your system
Python 3.13 or later (Bazel fetches its own managed toolchain; this version applies to the Bazel-managed Python, not necessarily the host system Python)
Other TAPA dependencies

Install these tools using your OS package manager. For Ubuntu:

# Install bazel
sudo apt-get install apt-transport-https ca-certificates gnupg
curl -fsSL https://bazel.build/bazel-release.pub.gpg \
  | gpg --dearmor | sudo tee /usr/share/keyrings/bazel-archive-keyring.gpg
echo "deb [arch=amd64 signed-by=/usr/share/keyrings/bazel-archive-keyring.gpg] \
    https://storage.googleapis.com/bazel-apt stable jdk1.8" \
  | sudo tee /etc/apt/sources.list.d/bazel.list
sudo apt-get install bazel

# Install other tools
sudo apt-get install binutils git python3

Tip

For Bazel installation on other OS, see the Bazel documentation.

Note

The Dockerfile in the TAPA repository provides a complete build environment. Use it for containerized builds or run the Ubuntu commands to install required tools.

Clone the Repository

To get started with building TAPA from source, you'll need to clone the repository from GitHub:

git clone https://github.com/tuna/tapa.git

If you are contributing to TAPA, fork the repository and clone your fork instead. When you're ready to contribute, create a new branch for your changes, commit your work, and open a pull request to contribute your changes back to the main repository.

Modify the Build Configuration

You may need to modify the VARS.bzl file in the repository's root directory to specify the correct Vivado installation paths and versions. The build script currently assumes default installation paths at /opt/tools/xilinx/Vivado/2024.2 and /opt/tools/xilinx/Vivado/2022.2 for Vivado, and /opt/tools/xilinx/Vitis/2024.2 for Vitis.

If your Xilinx tools are installed in non-standard locations, please modify the XILINX_TOOL_PATH variable to reference the correct base installation directory for your Vivado and Vitis installations. You should also update XILINX_TOOL_VERSION to specify the version of the latest Xilinx tools you have installed. With these settings properly configured, the system will expect your Vivado installation to be located at {XILINX_TOOL_PATH}/Vivado/{XILINX_TOOL_VERSION}.

Furthermore, you should configure XILINX_TOOL_LEGACY_VERSION to indicate the earliest version of Xilinx tools installed on your system, along with XILINX_TOOL_LEGACY_PATH to point to the corresponding installation directory.

If your system does not have the Xilinx Runtime (XRT) installed, you can modify the HAS_XRT variable in the VARS.bzl file to False. This will prevent the tests to fail due to the absence of XRT.

Build TAPA from Source

To build TAPA, navigate to the root directory of the cloned repository and execute the following command:

bazel build //...

This command compiles all TAPA targets, including the compiler, runtime library, and tests.

For building a specific target, replace //... with the desired target name. For instance, to build only the TAPA compiler:

bazel build //tapa

Note

To view all available targets, run bazel query //....

To skip building for the tests, you could use:

bazel build //... -- -//tests/...

After the build process completes, you can find the compiled binaries in the bazel-bin directory. For example, the TAPA compiler binary is located at bazel-bin/tapa/tapa.

Note

The build process duration may vary depending on your system's performance. LLVM, a significant dependency used by TAPA for code generation, requires considerable time to build. Bazel will cache it after the initial build.

Use the Built TAPA

Important

Remember to source the Vivado settings script before running the TAPA compiler.

Once TAPA is built, you can use the compiled TAPA compiler to compile your designs. For example:

bazel-bin/tapa/tapa compile \
 -f tests/apps/bandwidth/bandwidth.cpp \
 --cflags -Itests/apps/bandwidth/ \
 -t Bandwidth \
 --clock-period 3 \
 --part-num xcu250-figd2104-2L-e

Remember to rerun the bazel build command whenever you make changes to the TAPA compiler or runtime library to ensure you're using the latest version.

Build the Documentation

The documentation is written in Markdown and built with mdBook. The Bazel build rules fetch the correct mdBook and mdbook-admonish binaries automatically — no separate install is needed.

Build a static HTML site:

bazel build //docs:build

The output is a tarball at bazel-bin/docs/book.tar.gz. Extract it to browse the HTML locally.

Serve with live reload during editing:

bazel run //docs:serve

This starts a local server (default: http://localhost:3000) that reloads automatically when source files change. Supported on Linux x86_64, macOS x86_64, and macOS arm64.

Note

The documentation source lives under docs/src/. The Bazel targets handle mdbook-admonish preprocessing automatically; do not run mdbook-admonish install manually in the source tree.

Run TAPA Tests

To run all TAPA tests, including unit tests and integration tests, use the following command in the repository's root directory:

bazel test //...

For running a specific test, replace //... with the test name. For example, to test only a specific app:

bazel test //tests/apps/vadd:vadd-xosim

Build Binary Distribution

To create a binary distribution of TAPA, navigate to the root directory of the cloned repository and execute the following command:

bazel build --config=release //:tapa-pkg-tar

Find the generated binary distribution in the bazel-bin directory, as a tarball named tapa-pkg-tar.tar.

Install the Binary Distribution

To install the binary distribution, extract the tarball to a directory of your choice:

tar -xvf bazel-bin/tapa-pkg-tar.tar -C /path/to/install

Access the TAPA compiler binary at /path/to/install/usr/bin/tapa.

Containerized Build (Advanced)

For those who prefer a containerized build environment, TAPA offers a GitHub Actions workflow that can be run locally using act. This approach ensures a consistent build environment across different systems.

Prerequisites

Install act by following the instructions in the act repository.
Ensure Docker is installed on your system, as act requires it to run the workflow.

Configuration

Before running act, set up the following configuration files:

Create a .secrets file in the repository root with the following content:
```
UBUNTU_PRO_TOKEN=[YOUR_UBUNTU_PRO_TOKEN]
MAC_ADDRESS=de:ed:be:ef:ca:fe
```
Replace [YOUR_UBUNTU_PRO_TOKEN] with your Ubuntu Pro token (available free for personal use) and de:ed:be:ef:ca:fe with your Vivado license MAC address.
Update the .actrc file in the repository root:
```
--secret-file .secrets
```
If your Vivado license and installation locations differ from the defaults (/share/software/licenses/xilinx-ci.lic and /share/software/tools respectively), update .github/actions/run-docker/action.yml accordingly.

Running Containerized Tests

To test TAPA in the containerized environment:

act -j test

This method often provides more consistent results than local testing due to the isolated environment. It also benefits from a shared Bazel cache between runs, potentially speeding up the build process.

Note

Build artifacts are not saved to the local bazel-bin directory in containerized builds. For debugging, you may need to build TAPA in your local environment. However, you can still add test cases and use act for testing your changes.

Creating a Binary Distribution

To create a binary distribution of TAPA:

act -j build

The resulting binary distribution is saved in the artifacts.out directory in the repository root (e.g., artifacts.out/1/tapa/tapa.tar.gz for the first build).

Installing the Binary Distribution

To install the binary distribution:

Extract the tarball to your preferred directory, or
Use the provided install.sh script to install TAPA to the default location:
```
TAPA_LOCAL_PACKAGE=./artifacts.out/1/tapa/tapa.tar.gz ./install.sh
```

Developing TAPA

Note

This section is intended for developers who want to contribute to TAPA. It explains the development process, the code structure, and the guidelines for contributing to the TAPA framework.

Development Environment

TAPA enforces a consistent coding style and provides tools to ensure code quality. Follow these steps to set up your development environment.

Install Pre-Commit Hooks

pip install pre-commit
pre-commit install

Note

The latest version of pre-commit is required, which depends on a newer Python version. Some hooks may fail if your Python version is outdated.

Pre-commit hooks run automatically before each commit to ensure code compliance with style guidelines. To manually run the checks:

pre-commit run --all-files

Install Python Dependencies for IDEs

While Bazel automatically installs required Python dependencies during build and test, you can manually install them for IDE access:

pip install -r tapa/requirements_lock.txt

Setting C++ Compiler Options for IDEs

Generate a compile_commands.json file to configure your IDE with Bazel's compiler options:

bazel run //:refresh_compile_commands

Code Structure

The TAPA codebase is organized into several key directories:

bazel/: Contains Bazel build configurations.

It defines how the TAPA compiler is used in the Bazel build system, and provides additional utilities for building and testing TAPA.
docs/: Includes documentation files.

The documentation is written in Markdown and built using mdBook.
fpga-runtime/: Provides the FPGA runtime library.

The FPGA runtime library is used to interact with simulator or FPGA based on provided bitstream. It uses fast lightweight simulator for cosimulation with XO object file, and interacts with XRT library for Vitis simulation or on-board testing with XCLBIN file.
tapa-cpp/: Customizes the Clang C++ preprocessor for TAPA.

The TAPA C++ preprocessor reprocesses TAPA C++ code before passing to tapacc compiler. It supports TAPA-specific features, such as [[tapa::pipeline]] annotations (maps to Vitis HLS PIPELINE pragma) and [[tapa::unroll]] annotations (maps to Vitis HLS UNROLL pragma).
tapa-lib/: Houses the TAPA runtime library.

The TAPA runtime library provides core functionality for TAPA tasks, streams, and memory maps. It implements platform-specific features (e.g., software simulation queues, hardware FIFOs).
tapa-llvm-project/: Contains the LLVM project with TAPA-specific patches (fetched as an external Bazel dependency, not checked in to the repository).

TAPA uses LLVM Clang to generate system interconnect and transformed C++ code for each task. The LLVM project is customized with TAPA-specific features, such as C++ annotations.
tapa-system-include/: Creates a custom system include directory for TAPA.

This Bazel build target collects system include files for tapa-cpp and tapacc compilers. It includes standard C++ headers, TAPA dependencies, and TAPA-specific headers for the compilers to run on every OS.
tapa/: Contains the core TAPA compiler and runtime library.

The TAPA compiler serves as the entry point for the TAPA framework. It invokes tapa-cpp and tapacc compilers, synthesizes tasks into RTL using HLS tools, and generates system interconnect and XO object file for FPGA. For the xilinx-hls target, a .zip RTL archive is generated instead.
tapacc/: Implements the TAPA C++ compiler to translate TAPA tasks to JSON.

The TAPA C++ compiler is a Clang-based compiler for TAPA tasks. It analyzes tasks and streams, generating JSON representation of tasks and dataflow.
tests/: Includes test cases for the TAPA compiler and runtime library.

The folder includes various TAPA applications. It includes microbenchmarks under apps/ for basic functionality testing, and regression/ for performance evaluation of TAPA compiled designs.

Update Dependencies

TAPA depends on several external libraries and tools. This section explains how to update these dependencies.

General Version Bump Process

When bumping versions, follow this general workflow:

Clear existing lock files.
Update dependency declarations.
Regenerate lock files.
Test the build.
Commit changes.

Bazel Dependencies

For Bazel dependencies:

Update the version numbers in MODULE.bazel.
Check the Bazel Central Registry for latest versions, and update the bazel_dep entries in MODULE.bazel accordingly.
Remove MODULE.bazel.lock to force regeneration.

For Python and Node.js toolchains in MODULE.bazel:

# Update Python version
python.toolchain(
    python_version = "3.13.2",  # Update version here
    ...
)
use_repo(python, python_3_13 = "python_3_13_2")  # Update repo name too

# Update Python version in pip declaration
pip.parse(
    python_version = "3.13.2",  # Update version here
    ...
)

# Update Node.js version
node.toolchain(node_version = "17.9.1")

Python Dependencies

To update Python packages:

# Clear existing lock file
echo > tapa/requirements_lock.txt

# Update the dependencies
bazel run //tapa:requirements.update

This will regenerate the requirements_lock.txt file with the latest compatible versions.

XRT Dependency

For XRT (Xilinx Runtime):

Check the XRT GitHub releases for latest versions.

Update the version and SHA256 checksum in MODULE.bazel:

XRT_VERSION = "202420.2.18.179"  # Update version
XRT_SHA256 = "..."  # Update SHA256 checksum

Calculate SHA256 checksum with:

curl -L https://github.com/Xilinx/XRT/archive/refs/tags/{VERSION}.tar.gz | sha256sum

LLVM Version Updates

To update the LLVM version:

Find the latest stable release of LLVM on LLVM GitHub releases.

Update the version numbers in MODULE.bazel:

LLVM_VERSION_MAJOR = 20
LLVM_VERSION_MINOR = 1
LLVM_VERSION_PATCH = 4

Update the SHA256 checksum after downloading the new version:
```
LLVM_SHA256 = "<new_sha256_checksum>"
```

Docker Images

For the Docker testing and building environments:

Update the base image versions in .github/docker/*.

Update the system dependencies trigger date to the current date, so that the Docker image is rebuilt with the latest system dependencies:

RUN apt-get update && \
    # Update the following line to the latest date for retriggering the docker build
    echo "Installing system dependencies as of 20250505" && \
    apt-get upgrade -y

Pre-commit Hooks

Update pre-commit hooks to the latest versions:

pre-commit autoupdate

Verifying Updates

After updating dependencies:

Remove the lock file: rm MODULE.bazel.lock
Run a full build: bazel build //...
Run the pre-commit checks: pre-commit run --all-files
Commit the changes: git commit -a -m "build(deps): bump versions"

Note

This section provides guidance on updating all types of dependencies in the TAPA project, including where to find the latest versions and how to verify that the updates work correctly.

Contributing to TAPA

Pull Request Process

Fork the TAPA repository and create a new branch for your feature or bug fix.
Ensure all tests pass and pre-commit hooks run successfully.
Write a clear and concise description of your changes in the pull request.
Request a review from the TAPA maintainers.

Continuous Integration

TAPA uses GitHub Actions for continuous integration. The CI pipeline:

Builds binary distributions on Ubuntu 18.04 self-hosted runners.
Performs code quality checks using pre-commit hooks on every commit.
Runs functional and integration tests via staging workflows across a matrix of platforms and Vitis versions for every main branch push.

Documentation

Update the documentation in the docs/ directory for any new features or changes.
Use Markdown format for documentation files.
Run the following command in the docs/ directory to build and preview documentation changes locally:
```
bash build.sh
```

Testing

Add appropriate unit tests for new features or bug fixes.
Ensure all existing tests pass before submitting your changes.
Run the full test suite using the following command:
```
bazel test //...
```

Reporting Issues

Use the GitHub issue tracker to report bugs or suggest new features.
Provide a clear and concise description of the issue or feature request.
Include steps to reproduce the issue, if applicable.
Attach relevant log files or screenshots, if available.

Community Guidelines

Be respectful and considerate in all interactions with other contributors.
Provide constructive feedback on pull requests and issues.

Releasing TAPA Builds

Note

This section explains how to release TAPA builds. It is intended for maintainers with write access to the TAPA repository.

Automated Release Process

Releases are automated via GitHub Actions. The publish-release.yml workflow builds and publishes a release to GitHub Releases.

To create a release:

Update the VERSION file on main with the desired version string (e.g. 0.1.20260319).
Trigger the Publish Release workflow via workflow_dispatch from the GitHub Actions UI. Optionally override the version in the input field; if left blank, the contents of the VERSION file are used.

The workflow will:

Build the release tarball on a self-hosted runner
Create the git tag v<version> on main
Publish tapa.tar.gz and tapa-visualizer.tar.gz to GitHub Releases

Staging Builds

Every push to main triggers the staging-build.yml workflow, which runs the full test matrix across all supported OS and Vitis version combinations. Staging builds are uploaded as workflow artifacts (retained for 7 days) but are not published as releases.

Installing a Release

Users can install a published release with:

curl -fsSL https://raw.githubusercontent.com/tuna/tapa/main/install.sh | sh -s -- -q

To install a specific version by tag:

curl -fsSL https://raw.githubusercontent.com/tuna/tapa/main/install.sh | TAPA_VERSION=x.y.z sh -s -- -q

To install from a local release tarball:

TAPA_LOCAL_PACKAGE=./tapa.tar.gz ./install.sh -q

Keyboard shortcuts

TAPA Documentation