PL AI Engine Control (PLAIC)

AIE control is normally handled from the PS, however, it’s possible to achieve better performance on many of these operations by executing them from a controller residing in PL. Instead of commands having to go from PS -> PL -> AIE, they can just go from PL -> AIE. VSI PLAIC resides in the PL and executes programmable instructions to fulfill given control tasks.

By utilizing VSI’s PLAIC in a design, a user can find the following advantages.

  • Faster execution of basic graph commands (eg. graph run, graph wait, etc).
  • Ability to execute kernel code that exceeds the standard prog-mem limit of 16KB.
  • Achieve faster kernel program switches if implementing load-on-demand AIE.

Note: Importing “sub-graphs” into the AIE via software import wizards is not supported with PLAIC at this time.

VSI installs with an example application which will be used to highlight the various steps necessary to implement a design that uses PLAIC.

Platform setup

  1. From VSI’s start page, under “Quick Start”, click “Open Example Project” to start the project setup wizard.

  2. At the “Select Project Template” pane, select VCK190. This is a platform based around a Versal IC which has AI Engines.

  3. Proceed through the next couple screens by choosing a name and location for the project, then “Versal VCK190 Evaluation Platform” for the board.

  4. Next will come the “Select Design Preset” pane. Scroll to the bottom to find where PLAIC can be configured in the platform. Check the box “Add PLAIC”, which will then allow deeper configurations. Check “enable sync out”. As mentioned earlier, the PLAIC is programmable, so it can be instructed to send a “sync out” signal if necessary to coordinate its activity with any other component in the application. After this configuration gets implemented in the platform, we will revisit what is done and a more detailed explanation of the configuration options will be given.
    alt text

  5. We will now select the preset application to implement. Scroll back to the top of the pane, and with the “Choose Application” drop-down menu, select “plaic large kernel program”.
    alt text
    If instead the user wanted to develop their application from scratch, the preset would be left as “blank”. However, to keep the scope of this documentation to just PLAIC related aspects of the design, the preset application will be generated to implement all the other non-PLAIC related application components.

    The PLAIC application preset will implement a large AI kernel which without PL-control would require too much memory for the AI core to handle alone. The PLAIC will facilitate bringing in the necessary extra data from an external memory as needed by the AI core.

  6. Click “next”, then “finish”, to complete the application setup. VSI will take a few minutes to fully implement the design.

  7. Once VSI has created the design, it will leave off in the “System Canvas”, but we will now look back at the platform that was created. From the menu bar select Flow -> Open Platform -> vck190_base_platform.

  8. The configurations given during the platform setup merely automated the placement of the PLAIC IP, and its necessary connections to other components. A user could place this manually in their own Versal platform if desired. Click the “+” symbol on the versal_fabric hierarchy to examine the PL layout of the platform. In here, find the component “plaic_0”.
    alt text

  9. Double click on the PLAIC IP. Note that the configuration options in here are the same as were in the platform setup wizard:

    • s_axi_control: Control of the startup, and mem offset configuration of the PLAIC as well as its running status. Before the PLAIC can run, it must have a program it can read. As an initialization step, this program must be loaded into memory before the command to start is provided to this port.
    • sync_in and sync_out: Ports for handling programmable stream signals that the PLAIC can use for coordination with other components in the application.

    PLAIC Mode:

    • Fixed program: The PLAIC will execute its entire program on its own from start to finish.
    • Slave: The PLAIC will execute single instructions at a time based on requests from an external master. If this mode set, additional cmd_in and cmd_resp interfaces will become available for connecting with the master.

    There are also two AXI-MM Full interfaces:

    • mem_master: The PLAIC will read its programming and any necessary meta-data from the memory connected at the endpoint of this interfaces. For this example, the application will have the PLAIC load an AIE core with ELF data. The instruction to load the core is PLAIC opcode-data, while the ELF is meta-data.
    • aie_master: Must be connected with a path to an AXI-MM slave interface on the AIE array. The PLAIC will communicate with the AIE through this port.
  10. Click cancel to not make any changes to the PLAIC configuration. Note the connections being made to the PLAIC IP.

  • s_axi_control: Connects back out to the system canvas. It will be attached to a driver in the PS, because the PS will be responsible for making sure the PLAIC’s program is in place before starting it.
  • sync_out: Also goes to the system canvas. It will be used so that the PLAIC can alert the AIE data-movers (RDMAs) when it’s safe to pass application data into the AI cores.
  • mem_master: Goes to NOC that has access to DDR memory.
  • aie_master: Goes to NOC that has access to an AIE S-AXI-MM port.

System Setup

  1. From the menu bar select Flow -> Open System -> versal_system.

  2. Click the “+” symbol on the versal_ps hierarchy, and find the block “plaic_driver_base”.
    alt text
    As mentioned earlier, the PLAIC’s startup sequence requires that its programming is first loaded into memory before it’s started. This driver block handles that. Note the connections…
    mem: connects to DDR memory via an arbiter.
    control: connects to the s_axi_control interface of the PLAIC IP via a system interface to the platform.
    When the software is built for the application, a PLAIC program will be compiled and made available to this driver. When the PS executable is run, this driver will load that program into DDR via the mem port, then send the start command to the PLAIC hardware. After that it will monitor the status of the PLAIC.

    sync ports: These are optional, and unnecessary in this application. Under certain use-cases it can be desirable to be able to communicate with the driver using these, however, for the most part, this driver can just perform it’s initialization tasks without any further user interaction.

  3. Double click on the plaic_driver_base block. This is a vsi_gen_ip set with a function (plaic_driver) imported from PLAIC driver source code that’s distributed with VSI:
    $VSI_INSTALL/common/ip_repo/plaic/software_driver/plaic_driver.cc
    Note the configurations:
    alt text

  4. The versal_ps hierarchy blocks: mem2aie_driver and aie2mem_driver, are only tangentially related to the PLAIC. These drive the RDMAs that move application data (as opposed to control data) through the AI kernel. The source code for these drivers can be found in: $VSI_INSTALL/target/common/hls_examples/lab/plaic/ps_src/plaic_fxp/driver.cc

    The RDMAs are programmable, and part of their instructions is to not move any data until they receive a “sync_in” signal, which the PLAIC will send after it loads and runs the AI kernel. The user does have control over the RDMA program, however programming them is beyond the scope of this section. To see where the sync happens in the mem2aie RDMA program, see line 4 in file:
    $VSI_INSTALL/target/common/hls_examples/lab/plaic/ps_src/rdma_mem2aie.h:4

  5. The versal_ps hierarchy block lv_memory is not related to the PLAIC so will not be discussed in detail. It’s responsible for loading and validating application data into DDR that the RDMAs will access.

  6. If not already opened, click the “+” symbol on the versal_fabric hierarchy. Note how the PLAIC’s sync_out signal from the platform (highlighted purple below) is sent through a broadcast to the RDMAs’ sync_in ports to facilitate the behavior described in step 4.
    alt text

  7. Click the “+” symbol on the versal_aie hierarchy.
    alt text

  8. Double click on the block “vsi_context” in the versal_aie hierarchy, and select the tab “AIE Options” to find where the last of the necessary system configurations are made.
    alt text

Most important here is the PLAIC enablement checkbox “Use PL control”.
Also important here is the test iteration setting of “1”. This controls how many times the AI kernel will run. After the kernel is started, the PLAIC will wait on it to complete. If the iteration count was set to infinite, the PLAIC would never exit the wait.

Programming

The default program that will run on the PLAIC is created automatically when the “Generate System” command is run. From the menu bar, select: Flow -> Generate System.

When generate system finishes, open the following file inside the project:
<project_root>/vsi_auto_gen/sw/versal_system/versal_aie/plaic_pre_instruct

init_ai_mem versal_aie all PM DMb
graph_run versal_aie top_graph_versal_aie_inst 1
sync_write
graph_wait versal_aie top_graph_versal_aie_inst
halt

This instruction sequence is what will be compiled and run on the PLAIC. However, that compilation step will not run until after the software is built. The user can, if desired, manually edit the program after “generate system” to achieve different behavior.

Instructions

PLAIC instructions language reference.

General format:

instruction_code [instruction argument section]

Generic argument sections (applicable to many instructions):

These will be arguments to many of the instructions.

  • [AIE build target]: When no LOD, this will be the AIE context hierarchy name. If there is LOD, this will be in format:
    <aie_context_hier_name>_<target_lod_ip_name>_<target_lod_id>
  • [Name of graph]: Graph that the instruction will apply to. VSI will often consolidate AIE blocks into single “top graphs”, which will be given names in the format:
    top_graph_<aie_build_target_name>_inst

Instruction Codes:

init_ai_mem [AIE build target] [Kernels to load] [Mem types to load]
Loads the AIE tiles with the initialization memory needed before they can run.
Special arg sections:

  • Kernels to load: Can be “all” to load all kernels in build target, or a list of subset of kernels.
  • Mem types to load: AIE mem types come in two types, program-mem and data-mem, which have type-codes “PM” and “DMb” respectively. Either one or both of these type-codes can be listed in this instruction argument section.

graph_run [AIE build target] [Name of graph] [Iteration count]
Runs all the kernels in the given graph.
Special arg sections:

  • Iteration count: Number of times each kernel function in the graph will be run.

graph_wait [AIE build target] [Name of graph] [Optional non-blocking "nb" code] [Optional cycles]
Wait for all the kernels of the given graph to complete.
Special arg sections:

  • Optional non-blocking “nb” code: With the “nb” code, a graph wait will be queued up by the PLAIC, but the PLAIC will not immediately wait for the graph to finish. Under some circumstances, it’s not necessary for the PLAIC to immediately wait (block) for a graph to finish running, the wait could be a lower priority task that the PLAIC can do after executing some other instruction. The “nb” flag allows higher priority instructions to be issued after the wait.
  • Optional cycles: Stop waiting after given number of cycles.

sync_write
PLAIC will output a 32 bit stream integer value of “1” on the sync_out port.

sync_read
PLAIC will read a 32 bit stream value on the sync_in port. PLAIC will block on the stream read if there is no value yet available on the stream.

graph_preload [AIE build target] [Name of graph]
Load any preloadable memory sections of the given graph. These are memory sections that are safe to load while a different graph is running.

halt
Stop the PLAIC instruction processor.

Slave Mode

When the PLAIC is set to slave mode, it is possible to execute any of the instructions in the plaic_pre_instruct file on-demand, from the PS. This can be desirable for instance, if the build target that’s loaded on the AIE must be reloaded with another target, depending on some condition that the PS is responding to. To setup an application for this, the following requirements must be met:

  1. Set platform PLAIC IP to slave mode.
    alt text

  2. Make cmd_in and cmd_resp ports system interfaces.
    alt text

  3. In the PS driver code:

    • Include header plaic_master_exec.h
    • Add two hls::stream<ap_axis_plaic<32> > args to the driver function.
    • Use plaic_exec(stream_to_plaic_cmd_in, stream_from_plaic_cmd_resp, "instruc string") to execute an instruction line. If the PLAIC is not yet started, a plaic_exec call will block until it is.

    Example:

    #include <plaic_master_exec.h>
    ...
    void example_driver(hls::stream<ap_axis_plaic<32> > &toPl,
                        hls::stream<ap_axis_plaic<32> > &frPl)
    {
        ...
        // Note: misc user-code an be placed between these calls as needed.
        plaic_exec(toPl, frPl, "init_ai_mem versal_aie all PM DMb");
        plaic_exec(toPl, frPl, "graph_run versal_aie top_graph_versal_aie_inst 1");
        plaic_exec(toPl, frPl, "sync_write");
        // Note: graph wait below is blocking.
        plaic_exec(toPl, frPl, "graph_wait versal_aie top_graph_versal_aie_inst");
        plaic_exec(toPl, frPl, "halt");
        ...
    }
    
  4. Make sure the cmd_in and cmd_resp system interfaces connect to the driver function block.
    alt text

Building and Running

  1. For the most part, the standard project build sequence is followed for when building an application that has PLAIC. However, one important requirement is that the AIE software is built before the PS software. There is a special software build sequence tho achieve this. In the Tcl console, execute the following commands (note: if it’s the first time building software for the project, the “clean” command is not necessary):

    vsi::clean_projects
    vsi::build_sw_for_plaic
    
  2. When the software build finishes, complete the project build by selecting the following from the menu bar:

    • Flow -> Build -> Build HLS
    • Flow -> Build -> Build Hardware
      Note: This example project is configured for “co-simulation”, so the hardware build will not take as long as a non-simulation build, but could still take 20 to 60 minutes, depending on the user’s workstation specs.
  3. Once the hardware build completes, it is possible to run the application in co-simulation. In the Tcl console, execute the following:

    vsi::launch_co_simulation versal_fabric -mode gui
    
  4. It might take a minute or two to bring up the entire simulation environment, but once it’s up, click the “Run All” button in the hardware control bar.
    alt text

  5. Important: While the PLAIC has much control over AIE tiles, it cannot program AIE stream-switches or DMA-modules. These programming configurations are done by loading an xclbin once as an initialization step. When the software was built, a script called logrun.sh was generated that will 1) load the xclbin 2) load the linux driver, then 3) run the PS executable.

    • The PS portion of the simulation runs in a QEMU virtual machine that is interacted with via an xterm window that came up when the cosim was started. It takes a minute or two for the virtual Linux to boot after the “run” command was issued, but once it does, the user will be able to start the application by issuing a command in this QEMU terminal. Execute the command as follows:
    /run/media/mmcblk0p1/versal_system/versal_ps/logrun.sh
    

    alt text

  6. Running in simulation is much slower than real hardware, so the simulation could take several minutes to complete (more or less depending on the user’s workstation specs). Also, the console prints can be a bit noisy, and it’s normal to see lots of [dma_transfer]: could not get cdma ... prints.
    What’s most important is that the terminal prints end with the following to indicate a successful run:

    load_validate_memory SUCCESS: valid data read back from DDR!
    
  7. End the simulation by first pausing the run:
    alt text
    Then from the menu bar, selecting: File -> Close Simulation

Load-on-demand

Developing a LOD AIE application that is PL-controlled has all the same requirements as when PS-only controlled, plus some additional constraints:

  • Graphs must be isomorphic between LOD applications. The PLAIC cannot reprogram the AIE stream-switches or DMA-modules, therefore when switching between LOD applications, they only things that can be reconfigured are the AI cores’ program-mem, data-mem, and control registers. This means the kernels in each LOD application must have the same connection structure. VSI will analyze the design for violations of this constraint, and report as error.
  • If a LOD app is in a “graph wait” state (cores running), the application cannot be switched to a different LOD graph until the wait completes.
  • Lod Ids in the design must start from zero, and increment by one for each new LOD app.