Intel Pentium 4 Processor
The Intel Pentium 4 processor was introduced in November 2000 with a 1.5GHz. The processor uses Intel NetBurst micro-architecture which characterized by higher clock rates and world-class performance. There are new other innovations designed to ensure that the processor offers the best performance. Intel Pentium 4 processor is designed to offer high performance in varied applications where end users value quality performance of the CPU (Boggs, Darrell, et al.). Some of the new experiences and innovation in this design include the 3D applications and games, multi-media and multi-tasking user environment (Tuck, Nathan, and Tullsen).
There is real time encoding of MPEG2 video with almost real-time MPEG4 encoding. Intel Pentium 4 processor has implemented the 43 million transistors which are base on Intel 0.18u CMOS process. It has 6 levels of aluminum interconnect.
There are many considerations that have to be made when designing an effective architecture. The key balancing point is between the comings with many features that increase the processor cost and validation efforts. Intel Pentium 4 has implements the NetBurst micro-architecture. This is illustrated in the block diagram below.
The front-end is part of the processor that is responsible for fetching of the instructions that are awaiting execution in the next programs. It prepares them to be used in the later, in the machine pipeline. This section is mainly for the execution of the decoded instructions which are of high bandwidth to the section referred to as out of order execution core. The past history is used by the front-end section to determine the next program execution task because it has very high accurate branch prediction logic. Instructions from level 2 are fetched based on the prediction of the front-end logic branch (Gepner, Pawel, and Kowalik). Uops are basic operations of the execution core which are decoded from the IA-32 instructions.
The Executing Trace Cache is the advanced level 1 section of the Netburst micro-architecture. This section is in between the instruction decode logic and the execution core as shown in the above block diagram which is unlike other instruction caches (Wijeratne, Sapumal, et al).
The micro-operations or the uops are stored in this location the Trace Cache. The instructions in this location are used as other normal instructions after being decoded (Wijeratne, Sapumal, et al). The uops are repeatedly used by the system since they are already decoded. The only time the IA-32 instruction decoder is only used when the Trace Cache does not have the decoded instructions needed.
Out-of order execution engine
Out-of order execution logic
This is the section where the instructions are prepared to be executed. There are many buffers in this section that enable the smoothing and re-ordering of the instruction flow hence performance optimization. This sections re-orders instructions to enable those instructions whose input operands are available get executed immediately. Program instructions that follow delayed instructions are given a chance to execute provided they do not depend on the delayed instructions. Through such activities, the ALU and the cache of the CPU are always kept busy as possible.
There is another section in the out of order engine called retirement. This is where reordering of the instructions is done. There were instructions that were executed out of order this is the section that brings them back to the order of execution. The Intel Pentium 4 processor retires up to 3 uops per clock cycle. The retirement also passes information to the branch history in order to update the branch prediction.
Integer and floating point execution units
This is where instructions are being executed it includes the register files which store the integer and floating-point data operand values that the instructions need to execute. The units include deferent types of integer and floating-point. The function of the execution units is to computes the results. The L1 data cache is used in most load and operation stores.
The NetBurst micro-architecture was crafted to offer deeper pipeline with fewer gates of the logic per clock cycle. There was a higher target for the main clock on the Pentium 4 processors compared to P6 processors. There are different frequencies for Pentium 4 processors where each has appropriate logic set to enable it achieve high performance. The fastest clock has a speed equal to the ALU-bypass execution loop, and it is mainly used for integer instruction programs. The other parts of the chip have half the speed of the fastest clock. The bus logic runs at the lowest speed of 100MHz to match the speed of the system bus needs (Tian, Xinmin, et al). The figure below shows the pipeline for P6 and Pentium 4 processors. Comparing the two, P6 has 10-stage mis-prediction pipeline while Pentium 4 has 20-stage mis-prediction pipeline of the micro-architecture.
The micro-architecture supports three levels of on-chip cache. There are only two levels that are implemented in Pentium 4 processor. The first level is composed of the caches for instructions and data. Level 2 sections stores instructions and data which cannot fit in level 1 and the Execution Trace Cache. The Pentium 4 uses the common type of RDRAM and DDR SDRAM memory. It has dual memory channel hence more modules must be added. All the Pentium 4 processors support DDR SDRAM or new designs of DDR2 SDRAM. The dual channel of DDR supports up 4 GB maximum memory.
Virtual memory is a technique for managing the physical memory resources. Intel Pentium 4 has a 32-bit virtual address to supplement the 4 GB main memory. The register is divided into eight 32-bit registers, eight FP stack registers and six segment registers.
Boggs, Darrell, et al. "The Microarchitecture of the Intel Pentium 4 Processor on 90nm Technology." Intel Technology Journal 8.1 (2004).
Gepner, Pawel, and Michal Filip Kowalik. "Multi-core processors: New way to achieve high system performance." Parallel Computing in Electrical Engineering, 2006. PAR ELEC 2006. International Symposium on. IEEE, 2006.
Tian, Xinmin, et al. "Inside the Intel 10.1 Compilers: New Threadizer and New Vectorizer for Intel Core2 Processors." Intel Technology Journal 11.4 (2007).
Tuck, Nathan, and Dean M. Tullsen. "Initial observations of the simultaneous multithreading Pentium 4 processor." Parallel Architectures and Compilation Techniques, 2003. PACT 2003. Proceedings. 12th International Conference on. IEEE, 2003.
Upton, Michael. "The Intel Pentium® 4 Processor." (2000): 1-15.
Wijeratne, Sapumal, et al. "A 9GHz 65nm Intel Pentium 4 processor integer execution core." Solid-State Circuits Conference, 2006. ISSCC 2006. Digest of Technical Papers. IEEE International. IEEE, 2006.