High Performance Computer Architecture

2022-05-01

HKU

字数统计: 7k | 阅读时长≈ 33 分钟

Hong Kong University 2022 summer semester class review —— ELEC6036 High Performance Computer Architecture.

Pipelining

Pipelining Lessons

Pipelining is an implementation technique whereby multiple instructions are overlapped in execution.
Pipelining is the key implementation technique that is currently used to make high

performance CPUs.
Pipelining doesn’t help to lower the latency of single task, it helps to increase the throughput(overall system performance) of entire workload.
Pipeline rate limited by slowest (longest) pipeline stage.
The principle: Multiple tasks operating simultaneously using different resources.
Potential speedup = Number pipe stages
Unbalanced lengths of pipe stages reduces speedup: that is, the key is every stage should have the same duration.
Time to “fill“ pipeline and time to “drain“ it reduces speedup.
Stall for Dependences. (stall = delay, i.e we will have delay in the pipelined computer whenever there is contro/data dependences between the instructions)

The Five Stages of Load

Ifetch: Instruction Fetch, fetch the instruction from the instruction memory.
Reg/Dec: Registers Fetch and Instruction Decode from register file’s Read ports
Exec: Calculate the memory address from ALU
Mem: Read the data from the Data Memory
Wr: Write the data back to the register file from register file’s Write port

As the instruction execution cycle uses different hardware components for different steps, it is possible to set up a pipeline.

Suppose each stage takes 1 nsec(nano-second). Then each instruction still takes 5 nsec. But once the pipeline has been filled, a complete instruction rolls out every 1 nsec. Thus, the speedup is 5.

Pipelining in the CPU

(instruction-level parallelism)

Pipelining is an implementation technique that exploits parallelism among instructions in a sequential instruction stream.
A major advantage of pipelining over “parallel processing” is that it is not visible to the programmer (whereas in parallel processing, the program usually needs to specify what kinds of tasks to be executed in parallel).
In a computer system, each pipeline stage completes a part of the instruction being executed.
The time required between moving an instruction one step down the pipeline is a machine cycle (the clock cycle). The length of a machine cycle is determined by the time required for the slowest stage to proceed.
The computer engineer should try to balance the length of each pipeline stage so as to achieve the ideal speedup. In practice, however, the pipeline stages will not be perfectly the same, and there are additional overheads. But we can get close to the ideal case (i.e. CPI = 1, Cycle Per Instruction).

Design Issues

We have to make sure that the same resource (e.g., ALU) is not used in more than one pipeline stage.
If the resources used in the same pipelining stage are different, then overlapping is possible.
However, we must note that to retain the intermediate values produced by an individual instruction for all its pipeline stages, we must include temporary registers between the pipeline stages.

($1, $2 or $3 - represent Register #1, #2 and #3)

Pipelining increases the processor instruction throughput — the number of instructions completed per unit time.
Remember: pipelining does not reduce the execution time of a single instruction.
The increase in instruction throughput means that the program runs faster, even though no single instruction runs faster.
Imbalance among the pipeline stages reduces performance since the clock cannot run faster than the time needed for the slowest pipeline stage.
Pipelining overhead can arise from the combination of pipeline register delay (propagation delay of the register to transfer the values between the pipeline stages) and other factors.

Graphically Representing Pipelines

Conventional Pipelined Execution Representation

Single Cycle, Multiple Cycle, vs. Pipeline

Suppose we execute 100 instructions:

Single Cycle Machine

- 45 ns/cycle x 1 CPI x 100 instructions = 4500ns
Multi-Cycle Machine

- 10 ns/cycle x 4.6 CPI (due to instruction mix) x 100 instructions = 4600ns
Ideal pipelined machine

- 10 ns/cycle x (1 CPI x 100 instructions + 4 cycles drain) = 1040ns!!

Ans: pipelining is much faster, i.e. in the above example, the total duration is around 1/4 of the total time required for single-sysle machine!

Why Pipeline? Because We Can!

Pipeline Hazards

Structural Hazards:

attempt to use the same resource in two different ways (e.g., by two different instructions) at the same time.

- e.g., combined washer/dryer would be a structural hazard or “folder” busy doing something else (e.g., watching TV ;-)

Control Hazards:

attempt to make a decision before condition is evaluated

- e.g., washing football uniforms and need to get proper detergent level; need to see after dryer before next load in

- branch instructions

Control Hazard Solution #1: Stall

Stall: wait until decision is clear

Impact: 2 lost cycles (i.e. 3 clock cycles per branch instruction) => slow

Move decision to the end of decode. (earlier to the decoder stage) —— save 1 cycle per branch.
Control Hazard Solution #2: Predict

Predict: guess one direction then back up if wrong

Impact: 0 lost cycles per branch instruction if right, 1 if wrong (right - 50% of time)
- Need to “Squash” and restart following instruction if wrong
- Produce CPI on branch of (1 * 0.5 + 2 * 0.5 = 1.5) （the CPI of the branch instructions）
- Total CPI might then be: 1.5 * 0.2 + 1 * 0.8 = 1.1 (20% branch)
More dynamic scheme: history of 1 branch ( - 90%) , e.g. if there is 30% branch, then the total CPI = 1.5 * 0.3 + 1 * 0.7

Delayed Branch: Redefine branch behavior (takes place after next instruction)

Impact: 0 clock cycles per branch instruction if can find instruction to put in “slot” ( - 50% of time)

As launch more instruction per clock syscle, less useful

Data Hazards:

attempt to use item before it is ready ( an instruction depends on the result of a previous instruction still in the pipeline.)

- e.g., one sock of pair in dryer and one in washer; can’t fold until get sock from washer through dryer

- instruction depends on result of prior instruction still in the pipeline

Data Hazard on r1: Read After Write (RAW)

Data Hazard Solution: Forwarding

Forwarding (or Bypassing): What About Loads?

Can always resolve these hazards by waiting(pipeline stall)

- pipeline control must detect the hazard

- take action (or delay action) to resolve hazards

Summary of Concepts

Reduce CPI by overlapping many instructions
- average throughput of approximately 1 CPI with fast clock
Utilize capabilities of the datapath
- start next instruction while working on the current one
- limited by length of longest stage (plus fill/flush)
- detect and resolve hazards
What makes it easy
all instructions are of the same length (very import for the pipeline design)
just a few instruction formats
memory operands appear only in loads(LW) and store(SW)
What makes it hard?
- structural hazards: suppose we had only one memory
- control hazards: need to worry about branch instructions
- data hazards: an instruction depends on a previous instruction (whenever two instructions have data dependence, there will be data hazards)

Issues in Pipelined Design

High Performance Techniques

How to Optimize the Pipeline? Extract More Parallelism! (multiple execution-stage PL computers e.g. ADD - EX1 and EX2, etc)
Compiler-Directed (Static) Approaches
- VLIW (Very Long Instruction Word)
- EPIC (Explicit Parallel Instruction Computer)
- Superscalar
- Software Pipelining
A Dynamic Approach—Scoreboard

Pipelining: Can we somehow make CPI closer to 1?

Let’s assume FP ( full pipelining):
If we have a 4-cycle instruction (an instruction requiring 4 execution cycles), then we need 3 instructions between a producing instruction and its use:
Getting CPI <1 : Processing Multiple Instructions/Cycles
- Use parallel processing!!
- Two main variations: superscalar and VLIW
- Superscalar: varying number of instructions/cycle (1 to 6)
  - parallelism and dependencies determined/resolved by HW(hardware)
  - IBM PowerPC 604, Sun UltraSPARC, DEC Alpha 21164, HP 7100
- Very Long Instruction Words (VLIW): fixed number of instructions (16) determined by compiler
  - pipeline is exposed; compiler must schedule delays to get right results
- Explicit Parallel Instruction Computer (EPIC)/Intel
  - 128 bit packets containing 3 instructions (can execute sequentially)
  - can link 128 bit packets together to allow more parallelism
  - compiler determines parallelism, HW checks dependencies and forwards/stalls
Parallelism: Overt vs. Covert

Compliation and ISA (Instruction Set Architecture)

Efficient compilation requires knowledge of the pipeline structure
- latency and bandwidth of each operation type
But a good ISA transcends several implementations with different pipelines
- should things like a delayed branch be in an ISA?
- should a compiler use the properties of one implementation when compiling for an ISA?
- do we need a new interface?

Very Long Instruction Word (VLIW) Computers

Pros(Strength)

Very simple hardware
- no dependency detection
- simple issue logic
- just ALUs and register files
Potentially exploits large amounts of ILP

Cons(Weakness)

Lockstep execution (static schedule)
- very sensitive to long latency operations(cache misses)
Global register file hard to build
Lots of NO-OPs
- poor “code density”
- I-cache capacity and bandwidth compromised
Must recompile sources to deliver potential
Implementation visible through ISA

EPIC: Explicit Parallel Instruction Computer

128-bit instructions:

three 3-address operations
a template that encodes dependencies
128 general registers
predication (branch prediction)
speculative load (load prediction)
Example: IA-64 of Intel/HP

Superscalar

Superscalar DLX: 2 instructions, 1 FP & 1 anything else (int / LW / SW)

fetch 64 bits/clock cycle; integer on left, FP on right
can only issue 2nd instruction if 1st instruction issues
more ports for FP registers to do FP load & FP op in a pair
1 cycle load delay expands to 3 instructions in SS (super-scalar)
- instruction in right half can’t use it, nor instructions in next slot

Limits of Superscalar:

While integer/FP split is simple for the HW, get CPI of 0.5 only for programs with:
- exactly 50% FP operations
- no hazards
If more instructions issue at same time, greater difficulty of decode and issue
- even 2-scalar => examine 2 opcodes, 6 register specifiers, & decide if 1 or 2 instructions
  
  can issue
VLIW: tradeoff instruction space for simple decoding
- the long instruction word has room for many operations
- by definition, all the operations the compiler puts in the long instruction word can execut in parallel
- e.g., 2 integer operations, 2 FP ops, 2 memory references, 1 branch; 16 to 24 bits for each of these fields ==> 7 x 16 or 112 bits to 7 x 24 or 168 bits wide
- need compiling technique that schedules across several branches

Software Pipelining

Observation: if iterations from loops are independent, then can get more ILP (instruction level parallelism) by taking instructions from different iterations
Software pipelining: reorganizes loops so that each iteration is made from instructions chosen from different iterations of the original loop (i.e., Tomasulo algorithms in SW)

Example:

Multiple Issue as compared with VLIW

Multiple Issue:

more complex issue logic
- check dependencies
- check structural hazards
- issue variable number of instructions (0-N)
- shift unissued instructions over
Able to run existing binaries
- recompile for performance, not correctness
Data paths identical
- but bypass requires detection
Neither VLIW or multiple-issue can schedule around run-time variation in instruction latency
- cache misses
Dealing with run-time variation requires run-time or dynamic scheduling

The problem with Static Scheduling (Compile-Time)

In-Order Execution:

an unexpected long latency blocks ready instructions from executing (scheduled code cannot be changed at run-time)
binaries need to be rescheduled (recompiled) for each new processor implementation
small number of named registers becomes a bottleneck.

动态调度：

通过硬件在程序执行时重新安排代码的执行序列来减少竞争引起的流水线停顿时间
动态调度流水线具备以下功能：
- 允许按序取多条指令和发射多条指令—-取指(IF)流水级允许按顺序取多条指令进入单口暂存器(single-entry latch)或队列(queue), 指令然后从latch或queue取出，进入ID节拍。
- 能检查并消除hazards—-将ID流水级分为独立的两级：Issue（ID1）级和Read operand（ID2）级：
  - Issue级功能 —-指令译码，检查是否存在结构竞争(即在这一流水级解决结构竞争问题）；
  - Read operands级功能 —-等到无数据竞争(RAW)后，读出操作数，即在这一流水级解决数据竞争问题

Scoreboard: a bookkeeping technique

Out-of-order execution divides ID stage:
- Issue——decode instructions, check for structural hazards
- Read operands——wait until no data hazards, then read operands
Scoreboards date to CDC6600 designed in 1963
Instructions execute whenever not dependent on previous instructions and no hazards
CDC6600: in-order issue, out-of-order execution, out-of-order commit (or completion) (out-of-order or dynamic order/scheduling, means that it DOES NOT have to follow the original order of the program code)
- no forwarding
- imprecise interrupt/exception model for now

Scoreboard Implications

算法设计

动态调度技术需要将ID译码段分成两个阶段：1是发射，2是读取操作数。
- 发射阶段对指令进行译码，检查结构冒险（例如有四个运算器：整数运算、加法器、乘法器、除法器，检查该指令需要使用的运算器是否正在被占用）读取操作数阶段检查数据冒险（读之前检查寄存器的值是否已经写回，或者是否会覆盖了之前的值）。
- 数据冒险的解决方法：
  - 读写冒险(RAW)：将指令和操作数保存起来，然后只能在读操作数阶段进行读取；
  - 写写冒险(WAW)：检测是否有其它指令会写回到相同的寄存器（检测到冒险），有则等待，直到其它的完成
- 发射阶段：假如检测到没有结构冒险和数据冒险，那么记分板将会将指令发射到相关的运算器，假如结构冒险或者写写冒险发生了，那么该指令将会等待，直到冒险消失为止。
- 读取操作数：没有数据相关了以后（之前的指令不会写回源寄存器或者正在写寄存器的情况时，读写冒险），读取操作数。读取操作数后将交给运算器，之后开始运算。发送到运算器的顺序可能是乱序的。
之后就是执行段以及写回段了。执行段在完成计算以后会通知记分板。记分板直到计算已经完成了，那么它进行读写冒险检验（即写之前是否已经读取了寄存器的值，例如 ADD F10,F0,F8 SUB F8,F8,F14，这里SUB指令写回时要检查ADD指令的F8是否已经读取了，仅此而已）假如检测到冒险，则等待，不然就可以写寄存器了。

Out-of-order completion ==> WAR, WAW hazards?

Solutions for WAR:
- stall writeback until registers have been read
- read registers only during Read Operands stage
No register renaming! (This is different from the Tomasulo’s computers to be discussed)
Need to have multiple instructions in execution phase ==>multiple execution units or pipelined execution units
Scoreboard keeps track of dependencies between instructions that have already issued
Scoreboard replaces ID, EX, WB with 4 stages (ID1(Issue), ID2(Read Operand), EX, WB(or completion))

Four Stages of Scoreboard Control

Issue—decode instructions & check for structural hazards (ID1)
- instructions issued in program order (for hazard checking)
- don’t issue if structural hazard
- don’t issue if instruction is output dependent on any previously issued but uncompleted instruction (no WAW hazards)
Read operands—wait until no data hazards, then read operands (ID2)
- all real dependencies (RAW hazards) resolved in this stage, since we wait for instructions to write back data
- no forwarding of data in this model
Execution—operate on operands (EX)
- the functional unit begins execution upon receiving operands; when the result is ready, it notifies the scoreboard that it has completed execution
- note that different instructions will require different number of clock cycles for the execution stage!
Write result—finish execution (WB)
- stall until no WAR hazards with previous instructions:

Three Parts of the Scoreboard

Instruction status:

Which of 4 steps the instruction is in
Functional unit status:

Indicates the state of the functional unit (FU). 9 fields for each functional unit:
Register result status:

Indicates which functional unit will write each register, if one exists; blank when no pending instructions will write that register

Scoreboard Example:

Summary of Concepts

Compiler scheduling
HW exploiting ILP
- works when we can’t possibly know dependencies at compile time
- code for one machine runs well on another
Key idea of scoreboard: allow instructions behind stall to proceed (decode => issue instructions and read operands)
- enables out-of-order execution ==> out-of-order completion
- ID stage checked both for structural and data dependencies
- original version didn’t handle forwarding
- no automatic register renaming (Tomasulo’s computer)

要点总结和补充（重点）

总结和补充一下记分牌工作流程中的要点：

一条指令能否发射，一看是否有功能部件空闲可用，这个信息包含在功能状态中；二看指令要写的寄存器是否正要被别的指令写，这个信息包含在寄存器状态中，观察这个信息是为了解决WAW冒险。
一条指令能否读数，要看记分牌是否提示源寄存器不可读，如果不可读，就说明该寄存器将要被别的前序指令改写，现在的指令要等待前序指令写回，观察这个信息是为了解决RAW冒险。
一条指令一旦读数完成，就必然可以进行运算，运算可以是多周期的，在第一个周期结束时应该改写功能状态，表明自己不再需要读寄存器。
一条指令能否写回，要看是否有指令需要读即将被改写的这个寄存器，具体一点来说，就是要观察标记Yes的Rj、Rk对应的寄存器里是否有当前指令的目的寄存器，如果有，就说明有指令需要读取寄存器的旧值，这样一来我们就要等指令读完旧值之后再写回，观察这个信息是为了解决WAR冒险。

Tomasulo Algorithm

Improved Dynamic Instruction Scheduling: Tomasulo Algorithm

由于记分牌算法只能检测竞争(WAR,WAW)并不能消除这两种竞争，所以将记分牌算法改进为Tomasulo算法。

Tomasulo算法的基本思想：

Tomasulo算法采用寄存器重命名（Renaming）方法，将记分牌中的寄存器用一大组虚拟寄存器名来代替，即用虚拟寄存器集来代替真实的FP寄存器组，由于虚拟寄存器集合所含有的寄存器数目远大于真是的寄存器组，所以可以用虚拟寄存器集来实现寄存器重命名。
虚拟寄存器组由三部分组成：
- 每个功能单元（FU）都带有的保留站（Reservation station）
- 取数缓冲区（Load buffers）—-保存被访问的存储单元的数据和地址
- 存数缓冲区（Store buffers）

Recall: Compiler Techniques for Parallelism

Loop unrolling => multiple iterations of loop in software:
- amortizes loop overhead over several iterations
- gives more opportunity for scheduling around stalls
Software Pipelining => take one instruction from each of several iterations of the loop
- software overlapping of loop iterations
Very Long Instruction Word machines (VLIW) =>multiple operations coded in single, long instruction
- requires sophisticated compiler to decide which operations can be done in parallel
- trace scheduling => find common path and schedule code as if branches didn’t exist (add “fixup code”)
ALL OF THE ABOVE require additional registers!

Recall: How are WAR and WAW hazard handled in scoreboard?

WAR hazards handled by stalling in WriteBack stage
WAW hazards handled by stalling in Issue stage
Are either of these real hazards???

Tomasulo Algorithm vs. Scoreboard

Control and buffers distributed with Function Units (FU) vs. centralized in scoreboard:
- FU buffers called “reservation stations“(A Buffer to a specific FU), have pending operands
Registers in instructions replaced by values or pointers to reservation stations(RS); called Register Renaming. (i.e. pointing to a specific RS to wait for its register result)
- avoids WAR, WAW hazards
- more reservation stations than registers, so can do optimizations that compilers can’t do
Results to FU from RS, not through registers, over Common Data Bus that broadcasts results to all FUs.
Load and Store treated as FUs with RSs as well
Integer instructions can go past branches, allowing FP ops beyond basic block in FP queue.

Organization

浮点寄存器（FP）通过一对总线与每一个功能单元（FU）相连接，这一对总线分别对应两个操作数，并通过一条总线与存数缓冲区（Load buffer）相连接。这里是浮点指令队列，指令在这里等待发射；
功能单元（FU）的输出和存数缓冲区（Load Buffer）的输出汇总在CDB与浮点寄存器（FP）的输入相连接。在这个算法中存储指令在执行前会先计算好存储地址；
公共数据总线（Common data bus）CDB：CDB与FP，Reservation station，store buffer等输入相连接，唯一无连接关系的是Load Buffer的输入。它可以直达寄存器堆（用来更新通用寄存器）、加法乘法存储单元的保留站（输送保留站中指令需要的数据）
由于保留站和Buffer都有对应的标识符，所以这里实现了重命名。保留站保留已经发射的指令的信息和缓冲下来的数据，指令一旦离开保留站就意味着指令开始执行

RS Components

Op: operation to perform in the unit(e.g., add or sub)
Vj, Vk, Value of source operands
- store buffers has V field, result to be stored
Qj, Qk: reservation stations producing source registers (value to be written)
- note: no ready flags as in scoreboard; Qj, Qk = 0 ==> ready
- store buffers only have Qi for RS producing result
Busy: indicates reservation station or FU is busy
Register result status — indicates which functional unit will write each register, if one exists; blank when no pending instructions that will write that register.

保留站的结构有点像cache，可能有多行数据，每一行都对应一条被发射到保留站的指令。
- 保留站每一行都有Busy位，指示这一行是否现存有指令；
- Vj和Vk与记分牌不同，记分牌的Vj和Vk会记录源寄存器的编号，而保留站则直接把能读取的数据直接拷贝到保留站中，可想而知，一旦数据进入保留站，那对应的寄存器就和这条指令没瓜葛了；
- Qj和Qk的信息和记分牌一样，记录尚不能读取的数据将由哪条指令算出；
- A是存储指令的地址，用于存放立即数和计算得到的地址数据。
看上去保留站和记分牌非常相似，但是两者其实有很大的不同。
- 以Add为例，保留站中有三行Add信息，这三行数据对应的是同一个加法单元，而在记分牌中这代表着三个加法单元。
- 记分牌那样的一条通路只对应一条信息的做法容易造成指令堵塞、无法发射，而保留站则为每条通路预留了缓冲区，指令可以在加法单元忙碌的时候发射到保留站的缓冲区待命。
其次，保留站会直接把读取的数据缓冲下来，而不像记分牌一样只记录一个寄存器编号，只记录编号的话会造成读后写阻塞，因为一条指令在正式执行前一直在监控着它的源寄存器，源寄存器的值是不能改变的，因此后续指令无法写回，只能阻塞流水，而保留站则贯彻了“数据一旦准备完毕，就立马执行指令”的思想，指令一旦发现有数据可读，就立马读下来，读下来之后，那个源寄存器的写与不写就不关己事了。
记分牌和保留站相同的地方是都记录了Qj和Qk，即一旦需要的数据被算出来，就通过Qj和Qk捕捉广播数据，这样的做法其实就是重命名，即用保留站的编号而不是寄存器编号来标记数据源。

除了保留站数据结构之外，Tomasulo同样要记录寄存器结果状态，记录信息和记分牌一样，Tomasulo也会记录寄存器将被哪条指令更新，这个信息在指令寻找源数据时被使用。

Three Stages

发射级（Issue）
- 若是一条FP操作指令，如果保留站有空（判断能否发射的唯一标准），则将其送至保留站，如果该指令的操作数已在FP寄存器，将操作数的值送往保留站。
- 若是一条Load或者Store指令，如果Buffer有空，将其送往相应的Buffer
- 如果保留站或者Buffer没有空，则存在结构竞争，停顿该指令，直到对应的保留站或者Buffer有空为止。
- 这一级完成了重命名，因为在保留站中的操作数不再使用寄存器号。
执行级（Execute）
- 若有一个或几个操作数未就绪，等待该操作数，并同时监控CDB。一旦操作数就绪，立即存入相应的保留站，若两个操作数均已就绪，则执行该操作，此级检查了是否存在RAW竞争。
写回级（Write Back）
- 当结果计算出来之后，将其写入CDB，并从CDB写入目的寄存器以及等待此结果的保留站，当连续写同一寄存器时，只有最后一次才能写入，消除了WAW竞争。

1. Issue:

Get instruction from FP Op Queue
- if reservation station free (no structural hazard), control issues instructions and sends operands (renames registers);

2. Execution:

Operate on operands (EX)
- when both operands ready then execute; if not ready, watch Common Data Bus for result

3. Write Result:

Finish execution (WB)
- write on Common Data Bus to all awaiting units(all the FU that are waiting for its broadcasted result); mark reservation station available

Remarks:

Normal data bus: data + destination (“go to” bus) (i.e. destination-oriented data bus in the conventional computers)
Common Data Bus: data + source (“come from” bus) (i.e. a NEW source-oriented data bus trying to broadcast result FROM a specific FU as the data source)
- 64 bits of data + 4 bits of Functional Unit source address
- write if matches expected Functional Unit (produces result); does the broadcast

Example

Tomasulo Example Cycle 57

Compare to Scoreboard Cycle 62

Tomuasulo调度流程中的要点：

一条指令能否发射，要看对应配置通路的保留站是否有空余，只要有空余，就可以发射到保留站中等待执行；发射的同时会被能读取的数据直接拷贝到保留站，这样做就不用考虑读后写冒险，后续的指令只要完成就可以写回，不用顾虑是否会有前序指令需要读取寄存器，换句话说，每一条被发射到保留站中的指令都不再需要读取寄存器堆。
指令在发射的时候会更新寄存器状态表，如果后序指令和前序指令的目的寄存器重合了，就用后序指令的写信息标志寄存器，表示只会把后序指令的计算结果写进寄存器，这样可以解决写后写冒险；
如果执行单元中有指令正在执行，其他指令就在保留站中等待；如果指令缺少源数据，就留在保留站中，时刻监听CDB总线，如果CDB总线广播了需要的数据，就立马拷贝下来，然后准备执行。
一条指令在源数据全部准备好之后就可以执行，执行可能是多周期的。
一条指令只要完成计算，就可以写回，写回的数据通过CDB总线直通寄存器堆和各个保留站。但是要注意的一点是指令的结果未必会写进寄存器堆，因为寄存器结果状态表中总是存有最新的状态，即如果发生写后写冒险，Tomasulo算法会记录下最新的写指令，而抛弃前序的写指令结果，前序写指令的结果不会写回到寄存器堆，这样的做法很符合数据流思维。

优点

Tomasulo成功地解决了三种冒险，实现了指令的乱序执行，且性能比记分牌更好，具体优化的地方有三：

记分牌每条通路只能存一条指令，导致经常有指令因为结构冒险而不能发射，而Tomasulo引入保留站之后每条通路可以缓冲下多条指令，这样的做法平缓了指令发射的速度；
写后写冒险时，记分牌过度纠结寄存器名字，会把所有指令的结果都写进寄存器堆，会因为写后写冒险阻塞指令发射，而Tomasulo只保存最新的写入值，这样即保证了正确的结果，又减少了无谓的工作；
读后写冒险时，记分牌过度纠结寄存器名字，指令在执行之前一直检测的是寄存器堆，一旦数据准备好，就会从寄存器堆中取数，这样的后果就是后序指令即使计算完结果也可能不能立刻写回寄存器堆，而Tomasulo则在发射时就拷贝数据，贯彻数据流的思想——“寄存器名字不重要，寄存器里的数据才重要”。

在写后读冒险中两者都用CDB总线实现了逻辑上的正确，都解决了写后读冒险。

缺点

但是Tomasulo也不是完美的，它也存在一些问题，图7的PPT显示了Tomasulo算法的一系列问题。

Tomasulo算法中每一个执行单元对应一个保留站，保留站中缓冲多条指令，所以有可能在同一周期有多条指令准备好数据，但是执行单元同时只能执行一条指令，所以就需要从中选择一条指令，简单的解决办法是为保留站的每一行增加一个年龄位，每次出现冲突，就选择最老的指令送到执行单元。

在前文的举例中，电路里的CDB总线只有一组，这意味着每一个周期只能写回一条指令，如果同时有多条指令完成，那就只能选择一条指令进行广播，别的指令等待；第二种办法是增加CDB总线，支持多指令广播，但是这会让电路面积大增，包括增加寄存器堆写口、增加保留站tag和CDB总线的比较电路、增加保留站写口。

Drawbacks

Complexity
- delays of 360/91, MIPS 10000, IBM 620?
Many associative stores(CDB) at high speed
Performance limited by Common Data Bus
- multiple CDBs ==> more FU logic for parallel associative stores

Why issue in-order?

In-order issue permits us to analyze dataflow of program
This way , we know exactly which results should flow to which subsequent instructions
- if we issued out-of-order, we would confuse RAW and WAR hazards
- the most advanced machines ALL issue in-order ;-)
This idea works perfectly well “in principle” with multiple instructions issued per clock:
- need to multi-port “rename table” and be able to re name a sequence of instructions together
- need to be able to issue to multiple reservation stations in a single cycle
- need to have 2x number of read ports and 1x number of write ports in register file
In-order issue can be serious bottleneck when issuing multiple instructions per clock-cycle

The 5 tips to develop an effective cloud architecture application

Ensure that your application is scalable by designing each component to be scalable on its own. If every component implements a service interface, responsible for its own scalability in all appropriate dimensions, the overall system will have a scalable base.
For better manageability and high-availability, make sure that your components are loosely coupled. The key is to build components without having tight dependencies between each other, so that if one component were to die (fail), sleep (not respond) or remain busy (slow to respond) for some reason, the other components in the system are built so as to continue to work as if no failure is happening.
Implement parallelization for better use of the infrastructure and for performance. Distributing the tasks on multiple machines, multithreading your requests and effective aggregation of results obtained in parallel are some of the techniques that help exploit the infrastructure.
After designing the basic functionality, use techniques and approaches that will ensure resilience. If any component fails (and failures happen all the time), the system should automatically alert, failover, and re-sync back to the ―last known state as if nothing had failed.
Don’t forget the cost factor. The key to building a cost-effective application is using on demand resources in your design. It’s wasteful to pay for infrastructure that is sitting idle.

There are numerous examples of cloud architectures including:

Processing pipelines
- Document processing pipelines to convert hundreds of thousands of documents from one format to another, e.g. from MS Word to PDF;
- Image processing piplelines – say resizing millions of images;
- Video transcoding piplelines – transcoding AVI to MPEG movies;
- Indexing – creating an index of web crawled documents;
- Data mining – performing search over millions of records.
Batch processing systems: back-office applications in finance, insurance or retail sectors, log analysis, nightly builds or automated unit testing and deployment, etc