# Optimized Test compression bandwidth management for Ultra-large-Scale System-on-Chip Architectures performing Scan Test Bandwidth Management

Vengala Abhilash (M.Tech), Department of ECE (VLSI& Es)<sup>1</sup> J. Pushparani M.Tech., Assistant Professor<sup>2</sup>

Holymary Institute of Technology, Ranga Reddy District, Hyderabad, TS 501301-INDIA

abhilashvengala333@gmail.com<sup>1</sup> pushparani.jelli@gmail.com<sup>2</sup>

#### Abstract

In today's increasingly complex and interconnected world, system-on-a-chip (SoC) performance requirements are influenced by existing as well as evolving and emerging applications. With Moore's law supplying billions of transistors, and uni-processor architectures delivering diminishing performance, multicore chips are emerging as the prevailing architecture in both general-purpose and application-specific markets. As the core count increases, the need for a scalable on-chip communication fabric that can deliver high bandwidth is gaining in importance, leading to recent multicore chips interconnected with sophisticated on-chip networks. This paper introduces several test logic architectures that facilitate preemptive test scheduling for SoC circuits with embedded deterministic test-based test data compression. The same solutions allow efficient handling of physical constraints in realistic applications. A detailed experimental analysis is carried out on different provisions, architectures and test-related factors to prove the proposed method efficiency over state of methods.

**Keywords:** Can-based test, test access mechanism (TAM), test application time, test compression, test scheduling, Bandwidth management, embedded deterministic test (EDT)

### **1. INTRODUCTION**

Intensive technological progress in semiconductor fabrication has facilitated shrinking chip features below 50 nanometers and moved toward three-dimensional integrated circuits. In consequence, we can observe an unparalleled growth in gate counts and in circuit's operational frequencies. It permits the Moore's law to remain still relevant with transistor count in a typical design doubling every two years.

Contemporary circuit growth in size forced division of design project into independent functional parts. Such approach has had a profound impact on the design process and forms the basis for producing modular system-on-a-chip (SoC) and system-in-a-package (SiP) circuits. These designs can include a variety of digital, analog, mixed-signal, memory, optical, micro-electro-mechanical, and radiofrequency cores. Most often individual cores are delivered by various vendors as license driven IP cores embracing reusable unit of logic, cell, or chip layout designs.

Popularity of SoC circuits has led to an unprecedented increase in the cost of test which became a serious challenge. This rise is primarily attributed to the difficulty in accessing embedded cores during test, long test development and test application times, and large volumes of test data. Application of new materials connected with new fabrication processes reduces the cost of a single transistor, however, at the same time introduces new types of manufacturing defects and changes the distribution of traditional failures. Increases in test data volume drive up the cost of testing by elevating both test application time and tester storage. These aspects decide about the cost-effective density of transistors on a chip. This is why electronics industry expects new test solutions to enable an acceptable ratio of test cost to production cost while maintaining high test quality.

Implementing a hierarchical DFT methodology for designs with a large number of cores poses significant challenges. First of all, the number of chip-level pins is limited and does not suffice to drive all cores in parallel. Given the pin limitations, it is impossible to determine the optimal allocation of pins to cores for the best compression. Furthermore, since a particular core can be reused in multiple designs, an optimal number of channel pins for this core when embedded in one design may invalidate test reuse in other designs. Under such circumstances, the chip integrators collect data for all individual cores, examine the data along with all constraints for the design, and then manually determine test schedules. This may result in suboptimal test data volume and compromised test application time, especially because of some outlier blocks having large pattern counts (PCs). Bandwidth management mitigates the dependence of core channels on the number of available chip-level pins, allows automatic scheduling of tests by making it transparent to the users, and significantly improves test planning at the core level. It also arbitrates the sharing of the chip-level channel pins, thereby guaranteeing the best data volume and test time reductions for the overall design. In this paper, we present a bandwidth management scheme for hierarchical designs that lets a designer tradeoff fixed and flexible channel allocations per core as well as physical constraints to minimize the routing overhead of the TAM-based networks. Furthermore, several techniques to deliver the control data during test are examined altogether with a new scheduling algorithm that allows changing the In and output channel allocations when switching the channel configurations.

#### 2. SYSTEM-ON-A-CHIP TESTING

With on-chip test compression becoming the production test standard, its application in SoC designs requires additional infrastructure to transport test data between the SoC pins and the embedded cores. The industry is currently witnessing a major change on how data is transferred between different parts of an electronic system. With data rates exceeding 1 Gb/s, parallel I/O schemes are being replaced by high-speed serial links. This is driven by a need to meet new bandwidth requirements and simplify designs. However, as growth in high-speed I/O implies less digital pins, it becomes imperative to run SoC tests in a reduced pin count test environment. Moreover, cost-effective SoC test requires scheduling. Unfortunately, even the simplest existing test scheduling algorithms are time consuming. Indeed, there are many solutions that are milestones in a SoC testing. To show their variety, previous work addressing test data transportation, test data scheduling, and optimization techniques that are directly related to this thesis, is presented here.

#### Testing in scan-based designs

Long-standing research in a very large scale integration design (VLSI) prototyping permitted formulation a set of rules that, accomplished, guarantee expected test quality. The main paradigms of design for testability (DFT) methodology assume that a circuit under test (CUT) will be controllable, observable, and predictable. Controllability quantifies a capability to initiate every local internal state in a circuit using only externally available circuit's inputs. This is complemented by observability that allows one to determine every state in a circuit by ensuring propagation of the signal to the circuit's outputs. Finally, predictability denotes that there is a feasibility to determine a state of a circuit in response to given input stimuli. All these abilities are exploited by Automatic Test Pattern Generator (ATPG) to deliberately control signal propagation in the design structure and generate test stimuli assuring testability (controllability and observability of targeted faults).

# 3. MULTICORE, MULTILAYER SOC ARCHITECTURE

SoC is a concept that has been around for a long time; the basic approach is to integrate more and more functionality into a given device. This integration can take the form of either hardware or solution software. Performance gains are traditionally achieved by increased clock rates and more advanced process nodes. Many SoC designs pair a DSP with a RISC processor to target specific applications. A more recent approach to increasing performance has been to create multicore devices. In this scenario, it's important to manage the competition for processing resources so that the full entitlement of the device can be realized. TI's new multicore, multilayer SoC architecture addresses these challenges and creates the first true network-on-chip infrastructure to unleash full multicore entitlement.

#### **Advancing Moore's Law**

The move to more advanced process nodes has been a key driver in keeping up with Moore's Law. In the case of Texas Instruments (TI)'s new family of devices, a move to 40 nanometers provides an impressive performance boost - but today's applications require more. TI's new SoC architecture provides the flexibility to include targeted coprocessing, fixed- and floating-point operation, optimized inter-element communication, and a variety of processor types (DSP, VSP, ARM®, etc.). TI's architecture incorporates DSP cores capable of both 32 GMACs per core for fixed-point operations and 16 GFLOPS for floating-point operations. This represents a performance boost that far exceeds the expectation of Moore's Law in a single generation and also brings to market the first floating-point processor capable of operating at the highest DSP performance levels. Figure 1 on the following page illustrates the new architecture.



Fig.3.1: Multicore, multilayer architecture

TI has designed comprehensive connectivity planes -TeraNet 2 which provides throughput of 2 terabits per second - to address the need to seamlessly interconnect various processing elements. TI's Multicore Shared Memory Controller provides direct access to the on-chip shared memory system and external DDR3 memory without robbing internal switch-fabric bandwidth, while TI's Multicore Navigator facilitates and manages communications across the SoC architecture through more than eight-thousand elements. Hyperlink 50 allows the interconnection of companion devices such as additional coprocessors or companion TI SoCs.

## 4. EXISTING METHOD

Today's multicore chip architectures require no trivial test solutions imposed by the relentless miniaturization of semiconductor devices, which have become much faster and less power hungry than their predecessors. This trend has given rise to the growing popularity of system-on-chip (SoC) designs because of their ability to encapsulate many disparate types of complex IP cores running at different clock rates with different power requirements and multiple power supply voltage levels. Many SoC-based test schemes proposed sofar utilize dedicated instrumentation, including test access mechanisms (TAMs) and test wrappers. TAMs are typically used to transfer test data between the SoCpins and embedded cores, whereas test wrappers form the interface between the core and SoC environment. Solutions involving both TAMs and wrappers accomplish such tasks as optimizing test interface architecture or control logic while addressing routing and layout constraints or hierarchy of cores scheduling test procedures and minimizing power consumption Techniques proposed in and [attempt to minimize SoC test time. The integrated scheme of reduces the test time by optimizing dedicated TAMs and pin-count aware test scheduling. Packet-switched networks-on-chip can replace dedicated TAMs in testing of SoC by delivering test data through an on-chip communication infrastructure.

#### **5. PROPOSED METHOD**

## A. Control data delivery

The approach summarized in the previous section does not make any specific provisions for the way the control data are delivered to SoC test logic in order to setup test configurations. It appears, however, that the number of test configurations, and hence the amount of control data one needs to employ and transfer between the ATE and the interconnection network address registers, may visibly impact test scheduling and the resultant test time. Consequently, three alternative schemes that can be used to upload control bits here have been proposed and have been showed how they determine the final SoC test logic architecture. Since the scan routing paths from the chip-level test pins to the core-level test pins are dynamically selected by patterns, this interconnection network is also referred to as a dynamic scan router (DSR).



#### The use of IJTAG

The IEEE P1687 working group is developing a standard for accessing on-chip test and debug features via the IEEE 1149.1 test access port (TAP). The purpose of this Internal JTAG (IJTAG) standard is to automate the way one can manage on-chip instruments, and to describe a language for communicating with them via the IEEE 1149.1 test data registers (TDR). If there is an IJTAG network available on the SoC, and the total number of test configurations is relatively small, one can use it to deliver the control data, as shown in Figure 5.1.

The SoC design of Figure 5.1 has a single TAP and three different blocks: two cores (C1 and C2) under test, and the DSR interfacing ATE with C1 and C2. TAP can be instructed to enable a test path via the P1687 segment insertion bits (SIB). Every SIB is used to either enable or disable the inclusion of an instrument into the TDI/TDO path. The TDR in C1 or C2 can be either bypassed or loaded with data putting both cores into specific test modes. The TDR in DSR receives the control data indicating which core and which of its test channels are connected to which ATE channels.

The advantage of using the IJTAG network to deliver the DSR control data is a simple and easy to implement flow as the network is frequently used to set the cores TDRs. However, such an approach can support only a limited number of configurations. This is because the IJTAG shift clock is typically 10 to 20 times slower than the scan shift clock. Delivering a large volume of control data can incur an unacceptable test time overhead and hence an increase in the total test time. In reality, it is justified to use this architecture only for a very small number of test configurations.

Fig.5.1: Using IJTAG network to transfer control data



Fig.5.2: Dedicated control chain-based architecture

#### **B.** Dedicated control chain

The SoC design of Figure 5.2 uses two dedicated control chains to deliver the control data. In principle, these structures are obtained by daisy chaining address registers of both DSRs. Let the design have n input test pins and m output test pins at the chip-level. One can insert n control chains driven by n input test pins through n de-multiplexers. The selector pin of each de-multiplexer is controlled by a signal CF . It is worth noting that if  $n \ge m$ , then n control chains suffice to supervise both input and output sides. If n < m, one chain can be used to control one input channel and multiple output channels as the number of control bits per channel is typically much smaller than the pattern shift cycles.

When CF is set to 0, a test pattern is only used to deliver the control data for de-multiplexers and multiplexers of the input and output DSR, respectively. This particular pattern does not have to be observed, and hence it only features an upload phase. However, the number of shift cycles for control patterns must match the shift cycles of conventional tests. This is because the ATE does not distinguish between control patterns and regular test vectors. If CF is equal to 1, patterns are applied as regular test vectors. The DSR, which has already been configured, connects then the respective ATE channels with the core test channels as indicated by the content of the address registers.

As can be seen in Figure 5.2, CF is obtained by XOR-ing the content of two flip-flops: A and Bs. Flip-flop A is the last cell of a control chain, while flip-flop Bs is the shadow of B – a pipelining stage in one of the scan input channels. Since A and Bs are initialized to 0, the first test pattern must be a control one, as CF is set to 0, too. Once the control data are shifted in, flip-flop A becomes asserted, so does CF. As a result, the regular scan



Figure 5.3: Pipeline architecture

test patterns can be applied. Given a test configuration, all patterns but the last one set B to 0, whereas Bs is updated at the end of each test pattern upload. The last scan pattern of one test configuration sets B and subsequently Bs to 1, then uses the resulting de-asserted CF to upload the new control data into the control chain and to setup a new test configuration. The remaining configurations and test patterns are applied in a similar manner by toggling the values of A and Bs. The advantage of using this architecture is its ability to support as many test configurations as desired since the control chain shift frequency is the same as that of conventional scan chains.

#### **C.** Pipeline architecture

One can also use the regular scan channels to deliver controls through pipelining stages, as shown in Figure 5.3. For each channel, this approach concatenates n + m control bits, where n and m are the numbers of control bits used by the input demultiplexers and the output multiplexers, respectively. Moreover, each control bit is shadowed to avoid distorting test configurations in the middle of test data shifting. The shadow registers are updated at the end of each pattern upload. Thus, when test pattern p launches a new test configuration, the corresponding control data need to be loaded earlier with pattern p - 1. Consequently, the first vector is exclusively a setup one. The architecture of Figure 5.3 also supports as many test configurations as required. However, the control data are always uploaded through the ATE channels as an integral part of a test vector. Hence, given a test configuration, the same control data are repeated for all test patterns. The amount of control data is small, though, as the number of control bits per channel is typically a tiny fraction of the test pattern shift cycles.

## 6. RESULTS

#### SoC test environment with on-chip compression

Simulation Results:



#### Synthesis Results:

RTL SCHEMATIC



**Technology Schematic:** 

## DOI: 10.18535/ijecs/v5i10.36



## **Design Summary:**

| Device Utilization Summary (estimated values) |      |           |             |  |  |  |
|-----------------------------------------------|------|-----------|-------------|--|--|--|
| Logic Utilization                             | Used | Available | Utilization |  |  |  |
| Number of Slices                              | 11   | 4656      | 0%          |  |  |  |
| Number of Slice Flip Flops                    | 7    | 9312      | 0%          |  |  |  |
| Number of 4 input LUTs                        | 20   | 9312      | 0%          |  |  |  |
| Number of bonded IOBs                         | 23   | 232       | 9%          |  |  |  |
| Number of GCLKs                               | 1    | 24        | 4%          |  |  |  |
| ·                                             |      |           |             |  |  |  |

## Timing Report:

| Data Path: os1<0> t | :o oc1 |         |        |                           |
|---------------------|--------|---------|--------|---------------------------|
|                     |        | Gate    | Net    |                           |
| Cell:in->out        | fanout | Delay   | Delay  | Logical Name (Net Name)   |
|                     |        |         |        |                           |
| IBUF:I->O           | 2      | 1.106   | 0.532  | os1_0_IBUF (os1_0_IBUF)   |
| LUT3:10->0          | 1      | 0.612   | 0.000  | m4/Mmux y 3 (m4/Mmux y 3) |
| MUXF5:11->0         | 1      | 0.278   | 0.357  | m4/Mmux y 2 f5 (oc1 OBUF) |
| OBUF:I->O           |        | 3.169   |        | oc1_OBUF (oc1)            |
|                     |        |         |        |                           |
| Total               |        | 6.054ns | (5.165 | ns logic, 0.889ns route)  |
|                     |        |         | (85.3% | logic, 14.7% route)       |

#### Dedicated control chain-based architecture

Design summary:

| Device Utilization Summary (estin | Ð    |           |             |
|-----------------------------------|------|-----------|-------------|
| Logic Utilization                 | Used | Available | Utilization |
| Number of Slices                  | 7    | 4656      | 0%          |
| Number of Slice Flip Flops        | 11   | 9312      | 0%          |
| Number of 4 input LUTs            | 10   | 9312      | 0%          |
| Number of bonded IOBs             | 6    | 232       | 2%          |
| Number of GCLKs                   | 1    | 24        | 4%          |
| Timing report:                    |      |           |             |

| Data Path: m12/ou | t to ocl |               |                  |                                                   |
|-------------------|----------|---------------|------------------|---------------------------------------------------|
| Cell:in->out      | fanout   | Gate<br>Delay | Net<br>Delay     | Logical Name (Net Name)                           |
| FDR:C->Q          | 2        | 0.514         | 0.532            | m12/out (m12/out)                                 |
| LUT3:10->0        | 1        | 0.612         | 0.000            | m15/Mmux y 3 (m15/Mmux y 3)                       |
| MUXF5:I1->0       | 1        | 0.278         | 0.357            | m15/Mmux y 2 f5 (oc1 OBUF)                        |
| OBUF:I->O         |          | 3.169         |                  | oc1_OBUF (oc1)                                    |
| Total             |          | 5.462ns       | (4.573<br>(83.7% | ns logic, 0.889ns route)<br>; logic, 16.3% route) |

## **RTL SCHEMATIC:**



## **Technology schematic:**



# Simulation results:

## DOI: 10.18535/ijecs/v5i10.36

|        |       |      |            |            | 550.0 | 00 ns  |            |  |
|--------|-------|------|------------|------------|-------|--------|------------|--|
| Name   | Value | 0 ns | <br>200 ns | <br>400 ns | <br>  | 600 ns | <br>800 ns |  |
| C oc1  | 0     |      |            |            |       |        |            |  |
| 10 oc2 | 0     |      |            |            |       |        |            |  |
| 1 clk  | 1     |      |            |            |       |        |            |  |
| ា💩 rst | 0     |      |            |            |       |        |            |  |
| 1💩 ic1 | 0     |      |            |            |       |        |            |  |
| 16 ic2 | 1     |      |            |            |       |        |            |  |
|        |       |      |            |            |       |        |            |  |
|        |       |      |            |            |       |        |            |  |
|        |       |      |            |            |       |        |            |  |
|        |       |      |            |            |       |        |            |  |
|        |       |      |            |            |       |        |            |  |
|        |       |      |            |            |       |        |            |  |
|        |       |      |            |            |       |        |            |  |
|        |       |      |            |            |       |        |            |  |
|        |       |      |            |            |       |        |            |  |
|        |       |      |            |            |       |        |            |  |
|        |       |      |            |            |       |        |            |  |
|        |       |      |            |            |       |        |            |  |

## **Pipeline architecture**

Design summary:

| Device Utilization Summary (estimated values) |                                                                                                                                                                                                                                                                                  |                                                                                                                                     |  |  |  |  |  |
|-----------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|--|--|--|--|--|
| Used                                          | Available                                                                                                                                                                                                                                                                        | Utilization                                                                                                                         |  |  |  |  |  |
| 6                                             | 4656                                                                                                                                                                                                                                                                             | 0%                                                                                                                                  |  |  |  |  |  |
| 9                                             | 9312                                                                                                                                                                                                                                                                             | 0%                                                                                                                                  |  |  |  |  |  |
| 6                                             | 9312                                                                                                                                                                                                                                                                             | 0%                                                                                                                                  |  |  |  |  |  |
| 5                                             | 232                                                                                                                                                                                                                                                                              | 2%                                                                                                                                  |  |  |  |  |  |
| 1                                             | 24                                                                                                                                                                                                                                                                               | 4%                                                                                                                                  |  |  |  |  |  |
|                                               | Used   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 | Used     Available       0     6     4656       0     9     9312       0     6     9312       0     5     232       1     24     24 |  |  |  |  |  |

### **Timing report:**

| Da | ta Path: shadow_m | 12/out to | oc1<br>Gate | Net                |                                                 |
|----|-------------------|-----------|-------------|--------------------|-------------------------------------------------|
|    | Cell:in->out      | fanout    | Delay       | Delay              | Logical Name (Net Name)                         |
|    | FDR:C->Q          | 2         | 0.514       | 0.532              | shadow_m12/out (shadow_r                        |
|    | LUT3:10->0        | 1         | 0.612       | 0.000              | m15/Mmux y 3 (m15/Mmux j                        |
|    | MUXF5:I1->0       | 1         | 0.278       | 0.357              | m15/Mmux y 2 f5 (oc1 OBU                        |
|    | OBUF:I->O         |           | 3.169       |                    | oc1_OBUF (oc1)                                  |
|    | Total             |           | 5.462ns     | (4.573)<br>(83.7%) | ns logic, 0.889ns route)<br>logic, 16.3% route) |

### **RTL SCHEMATIC:**



**Technology schematic:** 

#### DOI: 10.18535/ijecs/v5i10.36



Simulation results:



## 8. CONCLUSION

The paper has proposed a comprehensive solution for SoC testing in a test compression environment. Bandwidth-aware test compression and test compaction solutions as well as two different bandwidth management schemes have been introduced. The concept of bandwidth-aware channel utilization has been further exploited. A channel selection order technique and bandwidth-aware compression to reduce test data volume are presented. Both solutions increase uniformity of variable distribution in the decompressor which assures high compression ratio independent from the EDT environment configuration. Assuming that all SoC cores are wrapped testable units, this paper studies several practical issues regarding SoC-based testing that deploys on-chip test data compression with the ability to dynamically use ATE channels. The proposed solutions include methods used to deliver control data and test scheduling algorithms minimizing the overall test application time. Experimental results obtained for a large industrial SoC design confirm feasibility of the proposed

schemes and their ability to trade-off the number of test pins, design complexity of the TAM, and test application time.

#### REFERENCES

[1] K. Chakrabarty, "Test scheduling for core-based systems using mixedinteger linear programming," IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 19, no. 10, pp. 1163–1174, Oct. 2000.

[2] K. Chakrabarty, V. Iyengar, and M. D. Krasniewski, "Test planning for modular testing of hierarchical SOCs," IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 24, no. 3, pp. 435–448, Mar. 2005.

[3] A. Chandra and K. Chakrabarty, "A unified approach to reduce SOC test data volume, scan power and testing time," IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 22, no. 3, pp. 352–362, Mar. 2003. [4] S. K. Goel and E. J. Marinissen, "Effective and efficient test architecture design for SOCs," in Proc. Int. Test Conf. (ITC), 2002, pp. 529–538.

[5] S. K. Goel, E. J. Marinissen, A. Sehgal, and K. Chakrabarty, "Testing of SoCs with hierarchical cores: Common fallacies, test access optimization, and test scheduling," IEEE Trans. Comput., vol. 58, no. 3, pp. 409–423, Mar. 2009.

[6] P. T. Gonciari and B. M. Al-Hashimi, "A compressiondriven test access mechanism design approach," in Proc. 9th IEEE Eur. Test Symp. (ETS), May 2004, pp. 100–105.

[7] Y. Huang et al., "Optimal core wrapper width selection and SOC test scheduling based on 3-D bin packing algorithm," in Proc. Int. Test Conf. (ITC), 2002, pp. 74–82. [8] V. Iyengar and K. Chakrabarty, "System-on-a-chip test scheduling with precedence relationships, preemption, and power constraints," IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 21, no. 9, pp. 1088–1094, Sep. 2002.

[9] V. Iyengar, K. Chakrabarty, and E. J. Marinissen, "Test wrapper and test access mechanism co-optimization for system-on-chip," J. Electron. Test., Theory Appl., vol. 18, pp. 213–230, Apr. 2002.

[10] V. Iyengar, K. Chakrabarty, and E. J. Marinissen, "Efficient test access mechanism optimization for systemon-chip," IEEE Trans. Comput.- Aided Design Integr. Circuits Syst., vol. 22, no. 5, pp. 635–643, May 2003.