Dual Side Sparse Tensor Core

3 minute read

Published:

Implementation of dual side sparse tensor core on GPGPUSim

DSTC

References

Dual side sparse tensor core(ISCA’2021)

My Hypothesis

Since gpgpusim(performance mode) models hardware in a event-driven manner, the modified dataflow(outer product) is not directly reflected in the code(maybe in the m_fu part? Still need to verify). Instead, we need to dynamically determine the runtime latency of the original mma PTX, which is break down to omma instructions in DSTC. So the most heavily modified part should be, firstly, the instruction parser(to support omma and bomma), then, the operand collector(optimization shown in Fig.19), thirdly, the runtime sparse bitmap generation and compaction and popcounter, last, the adaptive HMMA instruction latency. As for the exact mapping of thread groups within a warp to their load/computation workload, it is not modeled in the simulator. This part of work should be verified by implementing RTL to estimate the PPA.

Implementation Details

operand collector(preliminaries)

behavior modeling of operand collector and microarchitecture

Operand collector based register file locates between InstructionDispatch(ID) Unit and Execution(EX) Unit. Within a SM(or sub-partition), different dispatch path or datapath shares the same register file. For each pach, a set of in_ports(of register file) and out_ports is assigned to opndcoll_rfu. Through out_ports, data value is send to collector units(cus), to wait for execution unit available to issue. The structure is alike reservation station in tomasulo algorithm, decouple execution and register file access.

opndcoll_rfu_t

relevant code

behavior modeling of operand collector is integrated in shader_core execution pipeline, shader_core_ctx::read_operands() invoke m_operand_collector.step(), shader
create operand collector register file port for tensor_core
in_ports is initialized with m_pipeline_reg(ID_OC), to model the hardwire from corresponding pipeline register to in_ports, the same is with out_ports(OC_EX).

add_cu_set and add_port

void shader_core_ctx::create_exec_pipeline() {
  // op collector configuration
  enum { SP_CUS, DP_CUS, SFU_CUS, TENSOR_CORE_CUS, INT_CUS, MEM_CUS, GEN_CUS };

  opndcoll_rfu_t::port_vector_t in_ports;
  opndcoll_rfu_t::port_vector_t out_ports;
  opndcoll_rfu_t::uint_vector_t cu_sets;

  // configure generic collectors
  m_operand_collector.add_cu_set(
      GEN_CUS, m_config->gpgpu_operand_collector_num_units_gen,
      m_config->gpgpu_operand_collector_num_out_ports_gen);

  for (unsigned i = 0; i < m_config->gpgpu_operand_collector_num_in_ports_gen;
       i++) {
    in_ports.push_back(&m_pipeline_reg[ID_OC_SP]);
    in_ports.push_back(&m_pipeline_reg[ID_OC_SFU]);
    in_ports.push_back(&m_pipeline_reg[ID_OC_MEM]);
    out_ports.push_back(&m_pipeline_reg[OC_EX_SP]);
    out_ports.push_back(&m_pipeline_reg[OC_EX_SFU]);
    out_ports.push_back(&m_pipeline_reg[OC_EX_MEM]);
    if (m_config->gpgpu_tensor_core_avail) {
      in_ports.push_back(&m_pipeline_reg[ID_OC_TENSOR_CORE]);
      out_ports.push_back(&m_pipeline_reg[OC_EX_TENSOR_CORE]);
    }
    if (m_config->gpgpu_num_dp_units > 0) {
      in_ports.push_back(&m_pipeline_reg[ID_OC_DP]);
      out_ports.push_back(&m_pipeline_reg[OC_EX_DP]);
    }
    if (m_config->gpgpu_num_int_units > 0) {
      in_ports.push_back(&m_pipeline_reg[ID_OC_INT]);
      out_ports.push_back(&m_pipeline_reg[OC_EX_INT]);
    }
    if (m_config->m_specialized_unit.size() > 0) {
      for (unsigned j = 0; j < m_config->m_specialized_unit.size(); ++j) {
        in_ports.push_back(
            &m_pipeline_reg[m_config->m_specialized_unit[j].ID_OC_SPEC_ID]);
        out_ports.push_back(
            &m_pipeline_reg[m_config->m_specialized_unit[j].OC_EX_SPEC_ID]);
      }
    }
    cu_sets.push_back((unsigned)GEN_CUS);
    m_operand_collector.add_port(in_ports, out_ports, cu_sets);
    in_ports.clear(), out_ports.clear(), cu_sets.clear();
  }
  m_operand_collector.init(m_config->gpgpu_num_reg_banks, this);
  
}

m_dispatch_port: dispatch issued instruction(warp scheduler issue to operand collector) to their corresponding datapath(dispatch unit) e.g. TensorCore and FP32 share the same datapath, while FP16 and INT8 share another datapath
m_issue_port: issue prepared source operands from source register file bank to m_fu(function unit) at execute phase in execute() function, whether source reg is ready or not is checked

void shader_core_ctx::execute(){
    unsigned issue_port = m_issue_port[n];
    register_set &issue_inst = m_pipeline_reg[issue_port];
}

initialize gpu pipeline

void shader_core_ctx::create_exec_pipeline() {
  for (unsigned k = 0; k < m_config->gpgpu_num_tensor_core_units; k++) {
    m_fu.push_back(new tensor_core(&m_pipeline_reg[EX_WB], m_config, this, k));
    m_dispatch_port.push_back(ID_OC_TENSOR_CORE);
    m_issue_port.push_back(OC_EX_TENSOR_CORE);
  }

  for (unsigned j = 0; j < m_config->m_specialized_unit.size(); j++) {
    for (unsigned k = 0; k < m_config->m_specialized_unit[j].num_units; k++) {
      m_fu.push_back(new specialized_unit(
          &m_pipeline_reg[EX_WB], m_config, this, SPEC_UNIT_START_ID + j,
          m_config->m_specialized_unit[j].name,
          m_config->m_specialized_unit[j].latency, k));
      m_dispatch_port.push_back(m_config->m_specialized_unit[j].ID_OC_SPEC_ID);
      m_issue_port.push_back(m_config->m_specialized_unit[j].OC_EX_SPEC_ID);
    }
  }
}

scheduler_unit issue instruction(same as other function unit)

void scheduler_unit::cycle() {
                bool tensor_core_pipe_avail =
                    (m_shader->m_config->gpgpu_num_tensor_core_units > 0) &&
                    m_tensor_core_out->has_free(
                        m_shader->m_config->sub_core_model, m_id);

                if (tensor_core_pipe_avail) {
                  m_shader->issue_warp(*m_tensor_core_out, pI, active_mask,
                                       warp_id, m_id);
                  issued++;
                  issued_inst = true;
                  warp_inst_issued = true;
                  previous_issued_inst_exec_type = exec_unit_type_t::TENSOR;
                }
}

Next analyze the whole pipeline

void shader_core_ctx::cycle() {
  if (!isactive() && get_not_completed() == 0) return;

  m_stats->shader_cycles[m_sid]++;
  writeback();
  execute();
  read_operands();
  issue();
  for (unsigned int i = 0; i < m_config->inst_fetch_throughput; ++i) {
    decode();
    fetch();
  }
}

Benchmark Result