PS3HaX Network Blog Spot: PS3 Hardware: Explained Part II

Frame-rate stability:
Frame rates vary for a number of reasons. It actually factors in considerably in the visual department because smoother and stable frame rates look better while playing a game. While 30 FPS is well-playable, 60 FPS at the same visual quality will just make the game feel much better.

The reason why I mention this here is that PC games typically showcase relatively unstable frame rates compared to consoles. Unless your PC is far beyond the recommended requirements of a game, you will probably notice that most games frequently drop to around 10-15fps in certain parts, and going up to 30fps or more during others. I’m not completely blaming this on developers since they have a lot of different hardware to worry about, but it is something that degrades the overall pleasure of playing a game.

Controllers:
Mouse and Keyboard vs Playstation 3 controller. When it comes to RTS and FPS games, then Playstation 3 is owned along with every other console. Playing these types of games on the highest multiplayer tiers will always yield better players on the mouse + keyboard combo. That being said, the controls can still work on the Playstation 3, and players can get relatively good.

For many other game types, a PC keyboard and mouse suffer almost like a console controller does with RTS and FPS. You’d probably want a PC gamepad or joystick to play flight sims, fighting games, racing games, sports games, and probably more. The problem with a PC is that these things aren’t standard and not every developer will care to put in rumble features or motion sensing features even if they are out for certain PC gamepads on the market. The number of buttons supported on a decently programmed PC game does scale accordingly though to whatever the user has. PCs are lagging behind in the pressure sensitivity department and I don’t even think DirectX supports detecting pressure on button presses unless they’ve actually updated it since DirectX8(fyi, DirectX9 still used the DirectInput8 API).

OMG Look at Crysis!!!:
Yeah, this game got its own mini-section due to how much it has annoys me on these forums. It is always being compared to the abilities of the next generation consoles processing abilities as it if is some unattainable goal for consoles.

Guess what is responsible for those graphics? I’ve already said it and you probably already know it if you’ve read and understood everything I wrote so far – top end graphics cards. Can the RSX beat it alone? I might lie to you and say “yeah it can do that” and fail to mention the RSX would likely be running at 5 frames per second if it did - as would any comparable PC graphics card would too. But I’d rather try to be a bit more honest than what nVidia would tell you. In order for the Playstation 3 to match or surpass those visuals, extreme optimization would have to be in place to take advantage of the RSX particular setup in hardware and it would have to take advantage of the RSX’s fat bus to the Cell processor and XDR RAM. Of course, at some point in the future when GeForce 8950GTX-SLIs come out, you could probably run Crysis at ridiculously high 16xAA, 16xAF, FP32 HDR and what have you settings, but those are j polish related visuals and the base look of the game will remain the same.

Short story is that you won’t be disappointed with the Playstation 3’s visuals. It will be quickly outdone by PC graphics cards in terms of the nitty gritty technical settings like AA, HDR, AF, and shader model version whatever. Don’t let that discourage you because artists and improved optimization techniques on the Cell + RSX will make the Playstation 3 visuals pick up where PC developers wouldn’t be spending time to write for specific graphics cards or architectures.

The Final Verdict?
While PCs GPUs are evolving and pushing the visuals beyond consoles due to new graphics card hardware being released yearly, the rest of the PC world is relatively static and offer much slower improvement when it comes to gaming. When multi-core CPUs hit the shelves for desktop PCs, there could be an increase in performance for games and more tasks could be offloaded to the CPU, but no more than what Xbox360 has or will show us with its 3 cores.

All of the next generation consoles already possess more games processing power than PCs with their improved SIMD units on the CPU side of things. Unfortunately, developers aren’t taking the best advantage of this extra power in most cases as writing computational code for games is more difficult than the logical approach of checking for specific cases for results of game interaction. Playstation 3 aims to change this pattern drastically and provide a method of simulating events that occur in games and less scripting of physics, animation, and other interaction.

PCs gaming or PS3 gaming doesn’t really have a clear technical winner. PC’s constantly evolve so they will always be better for graphics over time when you always have the top end graphics card. The Playstation 3 will offer more flexible computational power that can be applied to more accurate physics, sound, or other computational related tasks than a PC. PS3 cannot catch up graphically which seems to be the most important or obvious difference between games. But PCs will not catch up in physics processing and other computational simulations unless the physics card catches on and is integrated well.
*** POST UPDATE ***
"Spokesperson/Developer said the PS3 can/can’t do this!!!"
There have been so many references to people saying things about the next generation consoles that it’s worth having a section for them. There are a number of reasons why that person said what they said. They are either not well informed, speaking in a very limited context that has little likelihood, or speaking in a context which doesn’t hold much validity and is based on analysis which is taken out of context. Here are a few of the popular ones and analysis on why they are either wrong, right, or both depending on the context:

John Carmack:
Yeah, you knew I was going to bring him up. John Carmack was generally the mastermind behind the stunning visuals on the PC series Doom and Quake since the start of both series. He has invented a few computer graphics related techniques to bring these visuals to the table. It is also very important to note that his games generally push visual limits and not much else especially if you look at the game play quality of his latest games compared to their visual quality.

On G4TV:
He basically stated on G4TV that Sony made the “less optimal decision” with the peak performance of the PS3 related to the Cell processor. First thing to note is that he never said that the PS3 is weaker. He actually generally agrees that more can come out of the Playstation 3.

One of the worst things he says in this interview that should wave a huge red flag in front of your face is that he says you put most of the work in the 2 threads on the PPE. If you really know the Cell processor, then what he outlined is just one approach to solving problems and may not be the most efficient or optimal solution.

The other thing he seemed to have issues with was the SPEs having to run separate programs that need to be in small nuggets. For him and many developers, it may be an arduous process that they are not used to, but on the Cell, it should be the natural solution. Nuggets of code isn’t even entirely uncommon when one looks at library modules that are loaded at run time with programs on the PC(.dll files in Windows, .so files in Linux). SPEs are a nice fit to load external libraries (middleware), and even within a single project, modular code libraries can be well separated on individual or groups of SPEs. The general point is that nuggets aren’t all bad. If the solution to the problem doesn’t fit the paradigm well, then dividing equally sized and small nuggets may be hard to do. Otherwise, it shouldn’t have a huge impact on development time and could actually speed it up.

The last major thing he said is that Sony is forcing developers to sweat blood to take advantage of the Cell. This is kind of far from the truth. Developers do not have to take absolute advantage of what the Cell offers if it isn’t necessary for what they intend on delivering with their game.* On a basic level, one 3.2GHz general purpose core along with a the RSX graphics chip should serve games well coming straight from a PC with a comparable graphics card and a relatively average demand on CPU resources. This does “waste” 7 SPEs but if a game really doesn’t need it, or can’t find an easy to implement feature or function to put on it, then it should be left alone. Any game that aims to show something technically superior from the Cell CPU should be looking at the Cell in depth, and know its ins and outs pretty well. That forced labor is really just a piece of hardware that Sony is giving to them and a feeling of obligation to use them. It’s only forced if they are necessary to use to get the performance developers want. That is a big “if” but everyone has to measure up if the solution’s benefits are worth paying the price to achieve them. Jon Carmack probably doesn’t see it as being worth it.

The “less optimal decision” that Carmack referred to early in the interview is his assessment on what he believes developers will actually be able do on the Cell – not what they can do in terms of absolutes. He is primarily basing this off of the PC game development world which he is used to. There is truth to his statements as I don’t foresee any PC game developer being able to develop for the PS3 and get any kind of superior usage of its processing power. I do believe there are console developers that will make the machine

*It’s actually somewhat hypocritical for John Carmack to be angry or displeased with Sony at all for setting such a high bar with complex and powerful hardware. John Carmack himself actually has received some displeasure from other developers since if he sets a bar at a certain level, suddenly gamers are expecting all games to look on the level of his or better. The situation is similar in that they aren’t actually forced to do anything, but they feel obligated to if they want to rise up to a certain level.

At QuakeCon:
QuakeCon’s keynote speech is generally given by John Carmack. It is a speech where he gives his general view on where the industry has been going, and where it will likely go in the future. He looks at the gaming industry from many angles and mentions problems that occur in many projects as they are either unique to the industry or exist in the developer world in general.

The next generation consoles got some mention in his keynote speech in 2005 and a few more comments in 2006. He actually does recognize the Playststion 3 to be the more powerful hardware, but not by a considerable enough margin to outweigh the difficultly of programming for it and the effort needed.

In a lot of his recent QuakeCon 2006 keynote speech, he talked about the gains of certain algorithms and techniques in games. He is critical of some of them and sees that a lot of them might require a lot of effort for minimal gains. For the most part he is moving away from doing things like that as the cost of development goes up. This is also probably what factors pretty heavily into his preference of the Xbox360 platform and not an overall power issue assuming infinite time and money. That isn’t to say development for the Playstation 3 is ridiculously harder either.

Anyways, I think its best that any fan boy without a solid background of understanding of John Carmack and the game development world shouldn’t use what John Carmack says in pure words. He always speaks from a perspective – just like this article is doing right now.

Microsoft:
There are a number of Microsoft executives who have made statements about the Playstation 3’s hardware. Here are the major relevant ones:

General Overall Comparison Report Handed to IGN:
IGN: E3 2005: Microsoft's Xbox 360 vs. Sony's PlayStation 3

This is the first result if you type in “Xbox360 vs PS3” or the other way around in Google – and it’s extremely scary in its inaccuracies. The article is basically a direct replica of information IGN received from Microsoft that compares the two consoles. If you read the article, pay close attention to how much of the Cell’s hardware they are ignoring(SPEs). Pay attention to how they push general purpose processing power yet they stress their VMX-128 processing(which is not general purpose). Look at how they add up non-sensible bandwidth numbers that don’t make sense if you’re looking at a system diagram for the Xbox360. They also make huge speculation on what the RSX is actually capable of, when even more than a year later, it is still unknown. This article is important because it is actually the root of a lot of later press statements Microsoft has made about PS3 and outlines their basic strategy/premise against Playstation 3 from a technical perspective.

The grand analytical conclusion of this article was that in some areas the PS3 outdoes the Xbox360, but in other areas the Xbox360 outdoes the Playstation 3 – which is true. But their expert opinion is that all of the areas that the Xbox360 outdoes the Playstation 3, are the more important ones for gaming which makes the Xbox360 have “provably more performance than PS3.” I think those PR guys at Microsoft failed discrete math and all pre-requisite courses leading to it, because they really suck at proofs. I wish IGN had a technical staff that knew some of what they were talking about. Or else they wouldn’t have published it probably. If you hate me for writing this, you can read that article and believe every word written there freely to make yourself happy.

Matt Lee:
Interview with Microsoft's Matt Lee, Xbox 360 developer top gun: Page 1

In this interview, Matt Lee attempts to present a more technical look into the PS3 compared to Xbox360. He is definitely on the development side to things so I assume he holds enough completely knowledge of computer hardware and what hardware means to software. He’s written DopeWars, worked on an MMO for PC called Mythica, and straight from there moved to the Game Technology Group in Microsoft where he now advises other developers on how to write efficient code for Xbox360.

Matt was asked at some point during the interview to explain the Xbox360 architecture. I have already familiarized you with the Xbox360 architecture but you should compare it to his. In this section he made note about AltiVec(VMX-128) instruction set because he was asked to explain it. Matt answered and mentioned some of the additions to the VMX-128 instruction set which were either specific to Direct3D. He also said that the best way to multithread a game has not been decided yet.

When asked about if the Xbox360 hardware had anything to help accelerate physics, Matt pointed out the VMX-128 instruction first, then fell back to the symmetrical cores, 6 hardware threads to spread out the code, unified memory architecture, and even goes further to say the GPU could be used to accelerate physics because it is a math monster and architected reasonably well to handle general purpose calculations.

After saying this about his hardware, he had more to say about the PS3 when asked about it:

When asked about the Cell architecture he specifically says the Cell isn’t designed for gaming as much as Sony would have us believe and immediately focuses on the SPEs. He attacks it for not having branch prediction – which is true, but when you look at the stream/SIMD/vector processing paradigm, branches are rare. Whenever it can’t be avoided, the SPEs have minimal hardware to allow software hints to avoid the penalties of doing so. It seems to me that IBM/Sony/Toshiba went out on a limb to make sure the SPEs did what they needed while avoiding things that cause performance hits.

He also says that the SPEs are poorly suited to run most game code – wait a second, define “most game code” for us Matt. On the screen, general-purpose code takes up the most volume and even takes up the most space in memory. But in execution time, most game code isn’t general purpose and branch heavy.

Additionally, the 8 operational cores of the Cell, with 2 threads on one core provides for more thread space for parallelizing games. But I guess he forgot to give Playstation 3 the same consideration after mentioning the space to spread code across 3 cores.

He then says(in typical MS fashion) a line similar to “it can only do this” and points out that the SPEs are only good for serialized streaming math code that digital signal processors typically do. He may be right in what it is good for, but he is wrong if he thinks it is the only thing they are good for. Use your own judgment on what the SPEs are good for by understanding what they do, and understanding the things developers have processors do to solve them.

His next attack goes at the memory architecture (local store) of the SPEs and he says the lack of automatic cache coherency (traditional cache behavior) seems as if it would cause a lot of overhead to work with, having to copy results to system memory through DMA transactions. The problem with this statement is that he is restricting the operational nature of the SPEs to writing results of computations to system memory. This is far from the truth as it isn’t by any means a necessary action SPEs have to take and is less than optimal in many situations as all 7 SPEs and PPEs would be sharing 25.6GB/s bandwidth. An approach that works far better is using the most out of the element to element communication bandwidth on the EIB, and only accessing system memory when necessary. SPEs are also likely to output data to other input/output devices such as the graphics card, sound hardware, or to other elements to use in a typical game scenario. Writing out to system memory for communication and processing game data is merely the easiest approach in developer’s eyes as it changes the information passing approach to information sharing which is automatic. The fact of the matter is that the SPE local storage has the speed of a traditional cache, but requires manual control. This makes it harder to use, but allows the execution speed to be deterministic and constant and could possibly even exceed the efficiency of a traditional cache if it used that way purely. Assuming there is no need for this level of control, developers can fall back to letting compiler or middleware tools handle the SPE local storage for them.

Matt then moves focus to the PPE and says that the Cell lacks the VMX-128 enhancements. That is true, but why does he isolate that part of the Cell and ignore over 40% of the Cell’s die space? Considering the SPE instruction set is dedicated to vector processing, and is considered a superset of the functionality provided in VMX, the SPEs could be called a VMX enhancement too. It just breaks compatibility and goes down it’s own path for accomplishing the goals. The difference is like apples and oranges - except there are 7 SPEs, and 3 VMX-128 enhanced units.

He also quickly mentions that the single PPE in the Cell has half of the cache size, but fails to mention that Xbox360 is splitting this cache with 3 cores that do not have explicit control over the cache coherency. If the Xbox360 cores were actually running independent threads working on independent data, the behavior of the cache would be very unstable and each core would need to take an piece of the pie – dividing the cache size by 3. The Cell’s PPE is the only consumer of it L2 cache unless access is granted through the PPE’s execution. Additionally, each SPE has its own cache speed memory. I’m failing to see how on chip memory is actually a limitation for the Cell, but not for the XBox360. I guess the Xbox360 advantage needs to be qualified as 512kb of extra automatic cache. Whatever that means next to the extra 1280kb of total on-chip memory in the Cell is up to the developer’s ability to put it to good use.

He also says that all of the “work” has to be crammed onto the PPE in addition to the base PS3 functionality that will be available anywhere. The only “work” that has to crammed on the PPE is the work developers feel is better suited to run there rather than the SPEs. Rendering commands by far don’t have to come from the PPE as any core inside the Cell has equal access to other elements inside the Cell and out. Sending rendering commands is equivalent to any communication with the FlexIO bus. Additionally, he doesn’t know where the system level functionality were the console is placed, and the last official indication of this is that an SPE is used for that purpose.

Matt moves on and states that porting will be difficult (which is true – conversion from SIMD to general purpose and reverse and reconsidering how to use the bandwidth efficiently on both platforms). He says this in a manner which strongly implies that general purpose processing is what is needed and is more easily relocated inside the Xenon. I find it rather difficult to even concoct a reasonable scenario where relocating code between cores(keeping same thread pairings) provides any significant performance difference. Granted in the Cell, code may not be easily movable between an SPE and PPE, does it actually matter?

One of the ugliest pieces of information Matt shares is related to the RSX. He was very direct in mentioning that the audience that actually cared about the 512MBs of shared memory was the developers, and it is important to note that this is the only audience that would care for this since it is as a matter of developer ease, and not performance gains. What he said that was completely wrong was the “you'll never see a PS3 title with more than 256MB of textures at any given time, due to the split graphics and system memory banks” comment. Perhaps he was thinking of the PC world where the bandwidth between system RAM and CPU and video RAM and GPU is in the single digit GB/s order of magnitude. Thus textures in system memory will make a game drag if they are accessed frame by frame. Unlike a PC, the Cell and RSX are able to communicate with 35GB/s bandwidth, and the Cell has 25.6GB/s bandwidth to its XDR RAM. This translates to 25.6GB/s bandwidth between the RSX and system memory. More importantly, this extra bandwidth is coming from a separate bus than video memory meaning that developers might actually want to do this intentionally to increase total bandwidth to the RSX. Rest assured, Matt and Microsoft’s insight is that developers will never want to have split memory banks because it’s just that much easier to share bandwidth and not have to consider the difference. I firmly believe in the causes of developer ease, but I also believe that when making a console to appeal to gamers through performance, hardware comes first.

He finishes up his technical breakdown with his overall belief on the performance difference and calls it a “wash” due to theoretical peak performance numbers that Microsoft ran in the past. I think he is referring to that information in the previously mentioned IGN article which is rather skewed. When you compare theoretical peak processing performances, the Xbox360 is actually twice outdone in floating point operations and many other mathematical operations. I think he means to say that practical performance might be a wash if he anticipates developers will not take advantage of the Playstation 3’s power.

Of course, Matt does make sure he states that Microsoft’s development tools a years ahead of the competition. This scale is in terms of ease of use as power in a development tool is hard to quantify and isn’t ultimately responsible for the quality of the code that comes out. Technically C# is 20 years ahead of C++ and 30 years ahead of C, but that doesn’t prevent C/C++ from doing all of the same things and possibly even more that their successors.

Random Developers:
Magnus HÃ¶gdahl on IGN:
"The PS3 will have a content size advantage with Blu-ray and a CPU advantage for titles that are able to utilize a lot of the SPUs. The Xbox360 has a slight GPU advantage and its general purpose triple-core CPU is relatively easy to utilize compared to SPUs. I expect that it will be near impossible to tell Xbox360 and PS3 screenshots apart."

He’s a designer, not a programmer for one. And he’s working on a multiplatform title which pretty much already means the game is not considering how it will push either platform to the max unless making it better for either console, is as easy as flipping a switch. The difference he mentions is something that indicates he has barely even scratched the surface of understanding either console.

Software for PS3:
What is all of the hardware inside the Playstation 3 worth if no one writes software to use it? To the gaming consumer - absolutely nothing. Any of the standard key features of the Playstation 3, including the hard drive, motion sensing controller, blu-ray, network, and more, you can expect developers to use for games where it is clearly beneficial. Otherwise Sony will be using it in the base functionality of the console which may extend to game features (keyboard + mouse abstraction). However, in the case of USB, the devices that can be attached are broad. Across the network, the devices the Playstation 3 communicates with, are endless. Don’t expect games to use non-standard controllers or control features through USB except for something that may be well establish like Eye Toy. Don’t expect Playstation 3 to talk to your laptop running a Windows(Samba) share unless Sony writes software to do it. Some features may still go unused in games dependent on how developers see things fit. Whether or not features are taken advantage of depends on the ease of implementation, standardization of the hardware involved, and how much of a difference it will actually make in the end to consumers.

However, with all of this considered, “homebrew” is an entirely open field. Anything people might want to do with the console, will likely be done by a group of developers that also want to accomplish the same task on the Playstation 3. The word homebrew is less appropriate for Playstation 3 considering the Linux OS meant to be an open programming playground for Playstation 3 users and developers. You might not get everything you wanted that Sony hasn’t done for you, but it will go an extra mile to do useful things Sony may have missed or has no intention of doing due to legal issues.

Conclusion:
You don’t have to know that all this crap is in the Playstation 3 or what it means before you choose to buy it, or buy another console, or buy none at all. Gaming hardware is meant to play games. Being more powerful doesn’t directly result in games that are more fun. Developers have to harness the power and that seems to be a challenge with the Playstaiton 3. Rest assured though, the games industry is growing and budgets are rising to the level of putting the extra work to reap in larger benefits. Some companies may not be up par in skill for developing games for the Playstation 3 or are unwilling to undergo the risk when they do not know how many consumers will be exposed to their game. Playstation 3 is powerful hardware – the most powerful to date with broad functional capabilities too.

Background Topics / Index:
These are some of the common concepts which I may or may not have explained in the text. It isn’t somewhat in logical order groups unless I decide to sort it better or differently.

SIMD – Single Instruction Multiple Data. This allows a single instruction fetch to operate on multiple pieces of data. It’s kind of like accommodating multiple people on the same ferry ride instead of taking them individually on smaller boats.

Other similar/related types of processing are:
SISD – single instruction single data
MISD – multiple instruction single data
MIMD – multiple instruction multiple data

Vector processor/processing – Type of processing that involves arrays(vectors) of data needing the same operation applied to each element. Vector processors are most definitely SIMD processors.

Another related topic to vector processing is stream processing. This topic is very similar to the principles that apply to SIMD architectures.

MMX/SSE/3DNow! technologies were also all introduced to add better SIMD processing ability on general purpose CPUs. MMX didn’t do the best job of this as it shared floating point registers with Intel’s Pentium chips and thus made SIMD processing and scalar floating point operations unable to occur at the same time.

Scalar processor – Basically a SISD processor. They will execute instructions one at a time, with single pieces of data.

Superscalar processor –Basically a SIMD processor. It is able to handle multiple pieces of data for a single instruction.

DSP – Digital signal processing. Generally used in the process of taking an analog signal (sound/video), converting it to a digital form, processing it by applying some filter or transformation, outputting the results (in digital format to some internal part), and finally converting it to an analog signal again. The chain doesn’t have to be implemented like this and only one part of it actually represents the processing element.

GPU – Graphics processing unit. Generally performs the rendering of 3D worlds and images in a 3D game. At the very basic level, you pass it geometry that defines surfaces, textures to apply to those surfaces, and pass lighting parameters where it goes through numerous matrix multiplies and algorithms to generate an appropriate view that has depth cue effects. This is a vastly simplified explanation of the 3D graphics pipeline and graphics cards general do far more for games today. The obvious additions to the 3D pipeline are vertex shaders which apply per-vertex operations on vertices, and pixel shaders which perform per-pixel operations on pixels after they have been rasterized.

GPU hardware vastly outperforms traditional CPUs at this task, but it can be alleviated of this task to the CPU and the graphics card would be reduced to only moving memory(frame buffer) to the analog or digital outputs to the screen it is displayed on. This would make for an unbelievably slow game though. When graphics cards first came out, they were not very programmable. They had configurable pipelines depending on application needs, and you generally passed geometry as they simply rendered the scene according to those parameters. Vertex and pixel shaders are bringing more control back to the programmers rather than just configuring.

CPU – Central processing unit. This is the “software” programmable aspect of computers. Everything the CPU does is explicitly spelled out on some level of software. A game executes off of the CPU but will likely assign tasks to other processing units such as the GPU.

ALU – Arithmetic and Logic Unit. These units perform arithmetic(add, subtract, multiply, divide, etc) and logic(and, or, not, nand, xor, etc) operations on registers. CPUs and GPUs both have many ALUs that are used in many parts of the pipeline.

Latency – Access time. It is basically the time it takes for a message to start to arrive at its destination. You can think of it like the speed of sound. Latency would be the time it takes for the sound to travel to your listener and start to be heard, not the time it takes for you to complete your message. High latency is worse than low latency. On determinant of latency is the bus the data is traveling on – i.e. light travels slow in diamonds but faster in water. Another determinant for latency is the speed at which the request can be fulfilled – i.e. RAM has the actually find the memory bank to get the data from or write the data to. This is called CAS latency and is a far more in-depth concept that I just found out about while doing more research for this post. You can think of CAS latency as the time it takes the operator on the phone to find what listing you requested.

Bandwidth – Speed of transmission. It is basically the amount of data that can be sent and received continuously between two participants communicating. You can symbolize it as being the rate at which the speaker can talk and the listener can comprehend the speech. In the case of bandwidth, the limiter is the slower of the two participants. In other words, if the listener can only comprehend 100 words per minute, it doesn’t matter if the speaker can speak at 200 words per minute as 100 words will be dropped. In the case of computers, the speaker would be capped since the listener would be dropping data that needs to be received.

Deterministic – The property of being able to be predicted or determined.

Cache(ing) – a memory bank set up to improve access time(latency) from a slower part of memory. It is faster due to its architecture, and also likely due to its physical proximity to the unit that needs access to it.

CPU caches are used to avoid high latency access to RAM which is usually in hundreds of cycles as opposed to single digits to 20 cycles. Traditionally, CPU caches are hardware controlled which means it is “automatic” from a software perspective.

You also hear the term “cache” in other contexts such as hard drive caching. This is relative to an optical media format, and is primarily done to improve the bandwidth limitation from the optical media and not a latency issue – since latency to both devices is extremely high anyways. Caching also occurs with web browsers storing content offline and checking to see if they have updated since they last requested the resource.

Pipelining – Most all processors made for over a decade have a pipeline or pipelines. It is pretty much an assembly line of processing instructions, where a single instruction has to go through various stages for completion. It might not seem obvious that a simple instruction like addition would have stages, but it does have some.

These stages represent one clock cycle as instructions spend only one cycle in each stage and are completed to move onto the next. A 7 stage pipeline would mean that the longest instruction takes 7 cycles to complete from the moment of entry to the moment it exits the pipeline. In order to not waste this natural design, pipelining allows multiple instructions to be in the pipeline simultaneously. Thus an addition instruction between two operands can be in the pipeline at stage 3, subtraction in stage 2 with two operations, and a load instruction in the pipeline at stage 1. Doing this also means that an instruction decoder has to be placed at each stage in the pipeline to know what it needs to accomplish for that piece of the puzzle.

Deep pipeline – a deep pipeline is one with many stages. The Pentium 4 has around 20 stages in its pipeline. A deep pipeline presents issues which need to be addressed by other techniques.

Hyper-threading – Duplicating hardware pipelines to allow for two pipelines to execute code simultaneously. The benefit of this is primarily to allow processing of two independent threads simultaneously. It also provides the benefit if a pipeline is stalled due to a high latency memory access which could potentially take hundreds of cycles. It allows the other pipeline to run if it isn’t dependent on the other.

ILP – Instruction level parallelism. Generally this is a topic of scalar processors. It basically is a property of instructions that are independent of each other and can be executed concurrently. Additionally, if one considers this property a little deeper, you can see how it also means the instructions can be executed in any order.

Out-of-order execution – This is a CPU feature which would allow instructions to be executed out of the order that it is issued by the code as the hardware sees fit. The reason for this is also to avoid pipeline stalling. Basically, if an instruction will have a latency hit at all (cache or RAM access), it might be able to execute certain instructions before the completion of a slower task.

Out-of-order execution uses hardware to analyze ILP. It also has an instruction window at the end of the pipeline to re-order the instructions at the end if necessary.

In-order execution – Opposite of out-of-order execution. Basically, CPUs that have to issue instructions in order because they lack the hardware to execute them out of order and reorder them appropriately. This is generally considered inferior, but saves die space and power dissipation due to less hardware, allowing for higher clock speeds and greater efficiency.

Word size – a WORD in computing is the natural unit size that the particular architecture can handle. On a 32-bit processor, the WORD size is typically 32-bits and operations performed in memory are fastest when aligned in 32-bits segments. Instructions are 32-bits in size as to take only 1 cycle to fetch the instruction.

Word name variations come in DWORD (double word), or QWORD(quad word) – and hopefully you can figure out their relative sizes from their names. Today, WORD most commonly represents 32-bits even if the architecture may not be 32-bit.

VLIW – Very long instruction word. Binary opcodes are what tells hardware what to do. Along with those opcodes are labels for what registers to perform the operations on, or literals (data is actually in the instruction) to perform the operation on. Altogether, the opcode and parameter information make up an instruction. The size of an opcode is determined by the number of possible instructions the hardware can perform, and the size of the parameters is dependent on the number of registers possible to perform the operations on, or the size of the data that might be encoded into the instruction itself.

Typically, a 32-bit processor uses 32-bit instructions. This means that one instruction will be processed at a time as the whole 32 bits comes in at once and handles execution in a linear fashion. On a 64-bit or higher processor, a bigger instruction(64-bits wide) can be fetched in one cycle, but if the instruction set can fit in only 32-bits, this extra space can be used to fetch and decode two instructions at once to be executed in parallel down two separate pipelines.

VLIW is an approach that is based on ILP and is determined at compile time. Unlike superscalar CPUs, the VLIW approach contains separate instructions for each execution unit/pipeline instead of one operation to perform on many pieces of data, using multiple execution units to accomplish the same task in parallel.

Branch prediction – In a pipelined architecture one of the issues is branching. How does the CPU know which instructions will follow? In the case of a branch statement, the direction lies in the result of the conditional logic which is elsewhere in the pipeline and the results of which may not be known yet. If the pipeline is deep, then a higher number of cycles would be wasted by flushing the pipeline due to the wrong set of instructions being loaded.

Branch prediction hardware can be very simple or complex. One of the simplest techniques is a branch history. I forget the exact statistic, but in general branch results are usually the same as they were before. This is evidence if you’ve ever coded and done for loops with code that looks like this:

Code:

for(int i = 0; i < 100; i++)
{
       //do something
}

It might not be evident that there is conditional logic every iteration in a loop, but it is there in the form of checking if i is less than 100, and that check will be true 100 times, and false only once.

Loop unrolling – The processor of flattening out repeated code. On a high level it is the difference between telling someone to hand over 100 apples to you one at a time making it 100 iterations of “handing over.” Or you could tell them to hand you 100 apples, 5 at a time going over only 25 iterations of handing over. In code it might look like:

Code:

//normal loop
for(int i = 0; i < 100; i++)
array[i] = array[i] + 1;
  
//unrolled loop (to a degree)
for(int i = 0; i < 100; i+= 5)
{
array[i] = array[i] + 1;
array[i + 1] = array[i + 1] + 1;
array[i + 2] = array[i + 2] + 1;
array[i + 3] = array[i + 3] + 1;
array[i + 4] = array[i + 4] + 1;
}

Both loops accomplish the same thing but one takes larger strides, iterating less times, hitting the branch check ‘i < 100’ fewer times.

On a scalar processor this is a benefit due to less branching needing to be done, but it requires use of extra registers. This is why the SPEs have such a large register file of 128 registers at 126-bits wide.

Motherboard – houses all components of a computer based system. A CPU will connect to it, RAM will be connected to it, and video card will be connected to it with attached video memory. It is important that a motherboard’s support exceed or match the components that are put on it, otherwise the components will be stepped down to what the motherboard supports (in the cases where it is still compatible). For example, if you have fast DDR RAM, but your motherboard doesn’t support it, connecting DDR RAM will not give you higher bandwidth. If your motherboard doesn’t support hyper-threading/multi-core CPUs, your bios and operating system will never see it and thus never use it unless it is utilized by hardware mechanisms.

Northbridge – part of a motherboard that houses the fast components of a computer system. Typically this is the RAM and graphics chip. There is a relatively large amount of bandwidth required from these components.

Southbridge – part of a motherboard that houses the slower components of a computer system. Typically these are I/O devices like optical drives, hard drives, USB, network devices, and other permanent storage devices.

Queue – first in first out (FIFO) data structure. Basically it is the traditional concept of a line where the order of exiting is the same order they came into the line. Queues are typically modified from this strict traditional sense to accommodate for priority. A priority queue, would be something like a hospital line, where fatally injured patients are moved to the front of the line to be processed first while others with non-fatal injuries can wait and still be healed.

Read vs Write Priority – On a computer system with memory bandwidth limitations, if the bandwidth for reading and writing are not equal, the read speeds will typically be greater than the write. The reason for this is that write operations do not need to have occurred until it is actually accessed at a later point in time which isn’t always immediate. Write operations can wait in a queue to be processed when the bandwidth is available instead of immediately. On the contrary, reading means that the data is needed *now* in an operation or the system might have to wait for it causing a pipeline stall.

Order of Magnitude – Used to describe a scale of comparison. The scales are separated by an exponent, typically 10. For example, numbers between 1 and 10, are in the same magnitude of 10^1 (10 to the first power). Numbers between 10 and 100 are in the magnitude of 10^2. To be separated by orders of magnitude is the direct difference between the exponent of the scale of the two numbers. For example, 5 and 500 are two orders of magnitude away since 5 is magnitude 1 (10^1), and 500 is of magnitude 3 (10^3). 3 minus 1 is 2 -> two orders of magnitude separation. Order of magnitude differences in performance are huge in the world of computing.

In the computer science world, the common base for the logarithms is 2, not 10. So if you hear this term being used, it generally refers to the base of 10 unless you are talking about computing. In this post, the base I am referring to is 2, since this is the world of computing.

Order of magnitude may also be used to referred to different common scales depending on application. For example, seconds is a magnitude lower than minutes. Minutes is a magnitude lower than hours, and so on so forth on the scale of time. Same thing can be said for distance when you jump from inches, to feet, to miles; jump from centimeters, to decimeters, to meters, to decameters, to kilometers. The important thing to know is that order of magnitude is almost always specific to the context it is used in.

Dot product – A mathematical operation between two vectors that results in a scalar value (plain number). If this value is 0, the vectors are perpendicular, otherwise the larger the result of this operation, the more “parallel” the two vectors are – there is a maximum value of the result depending on the magnitude of the two vectors.

The calculation of a dot product can be done in two ways. A represents a vector, B represents another vector, and ~ represents the dot product operation. Both A and B have x, y, and z components are referenced using A(x, y, or z) and B(x, y, or z). Theta is the angle between the vectors A and B. Standard order of operations apply(multiply before adding). |A| represents the length (also called magnitude) of the vector A relative to the origin (0, 0, 0).

A~B = |A| * |B| * cos( theta )
Or
A ~ B = A(x) * B(x) + A(y) * B(y) + A(z) * B(z)

The prior way to find the dot product is not fast considering there is a trigonometric function, and calculating the magnitude of a vector involves other costly operations on computers that I won’t get into. The main purpose of this is to see that a dot product is nothing but a series of multiplies, and then addition between the results. This operation lends itself well to SIMD architectures.

Dot products come up often in 3D game programming. One application of the dot product is to generate vertex normals(vertex referring to a triangle surface and not 3D position). Normals are generally perpendicular vectors to the surface and are needed to apply realistic lighting models to surfaces. Surface normals can be used in other creative ways to accomplish other 3D effects too.

Game loop – The “loop” constitutes one frame of animation/action/calculations that a game performs to simulate a real time experience. The speed of this loop represents the frames per second the game is running at. It is pretty much the same as how a movie is displayed – each frame is displayed individually and shown to you to deliver the effect of motion. Similarly, each frame updates game objects and re-renders them in new positions according the new input, AI routines, or physics reactions. Traditional steps that need to be accomplished in a game loop are (loosely in order):

1. Get user input.
2. Update player avatar (who the player is controlling).
3. Update the game’s objects.
4. Check for collisions, apply physics, and react appropriately.
5. Render all of the objects in their new positions.

That is of course, a stripped down game loop. Sound could be initiated anywhere from steps 1-4 depending on what happens. Steps 3 and 4 are very interrelated, and even step 2 could be sucked into it. By no means is this order strict and often parts of the loop are going on throughout the loop as soon as possible (to ensure fastest completion).

Vertex – in the 3D graphics world, a vertex simply represents a single 3D point in space – x, y, and z. 3 vertices make up a triangle. 4 vertices make up a square (or two triangles). 3D worlds are made up of a large set of vertices to define primitives(basic 3D constructs) that are used to display objects on the screen.

Frame buffer – the memory that contains the pixel information of what can, or is going to be displayed on screen. A frame buffer is the result of rasterizing 3D information (converting to pixels). Pixel shaders operate on frame buffers.

Pixel shader – a program that is executed on the GPU that processes pixel information after the 3D world has been rasterized. A pixel shader could be responsible for post processing effects such as making the whole screen red by simply modifying all pixels to have more red than they started out with. They could put rain effects on the screen by adding pixel groups that simulate rain. They can blur parts of an image for focus effects. The primary thing to remember is that it occurs late in the 3d graphics pipeline and works on a frame buffer and outputs a modified frame buffer.

Vertex shader – a program that is executed on the GPU that processes vertex information. A vertex shader is a prime candidate for cloth simulation by modifying the vertex position of each point on the cloth surface. It can also be used to simulate water surfaces

Anti-aliasing – common abbreviated as “AA.” This is the process of getting rid of jagged edges and other artifacts that occur in a 3D image. The two approaches to accomplishing this are multi-sample anti-aliasing(MSAA), and full scene anti-aliasing(FSAA).

MSAA takes the frame buffer, and samples it a number of times, progressively producing a less anti-aliased picture. The number of times the image is sampled is represented by the multiplier in front of MSAA. In other words, 4xMSAA samples the screen 4 times – thus has to be read, processed, and dumped to memory 4 times before it is done.

FSAA is rendering the frame buffer at a much larger size than the resolution that will be displayed and then image is then down-sampled once to the display resolution. The multiplier in front of FSAA represents the size of the over-sample. In other words, 4xFSAA will render an image 4 times the size of the actual resolution. This means the frame buffer size will take 4 times the memory required than the actual end display surface.

VMX-128 – This is the SIMD processing unit and instruction set name for the PowerPC core. It was named AltiVec by Motorola who invented it. Due to trademark issues IBM has renamed it VMX-128.

ICT – Image Constraint Token – for HD-DVD and Blu-Ray, if content providers wish to turn this token on in their media, it will only enable playback through approved HDCP signals. If at any point, the signal goes through an unapproved medium, the image quality will be down-sampled to 540p.

References:
This is by no means a formal list or all inclusive. Over time it is hard for me to remember if I read something in an article, encyclopedia, journal, or learned it in the classroom. I’m actually positive I have many sources not listed here that I read, but I only added these because I either remember actually using something from it, or it was a good article that shared similar information with another that I did use. Thankfully this isn’t a dissertation or anything so no one’s going to fry me for plagiarism.

SIMD architectures: Page 1 - SIMD Architecture article.

Main Page - Wikipedia, the free encyclopedia - Wikipedia used to gain a general understanding of “what things are” for many of the topics. For absolutely specific information on the hardware, I generally followed its references to other articles.

http://www.blachford.info/computer/Cell/Cell0_v2.html - Very good old article giving an in-depth look at the Cell processor. Some of the specifics in this article might be invalid for the PS3’s Cell configuration.

http://www.ati.com/developer/eg05-xe...gett-final.pdf - PDF warning, Light XBox360 GPU coverage. Offers mostly architectural overview.

Just like being there: Papers from the Fall Processor Forum 2005: Application-customized CPU design - Most technically detailed article I found on the Xbox360 CPU hardware.

http://www.hotchips.org/archives/hc1.../HC17.S8T4.pdf - a good PDF with system level diagrams of Xbox360 hardware.

Inside the Xbox 360, Part II: the Xenon CPU: Page 1 - Arstechnica article covering the Xbox360 CPU.

Details of ATI's Xbox 360 GPU unveiled - The Tech Report - Page 1 – good article covering the Xbox360 GPU.

GamePC - Custom Build Computers, High Performance PC, Custom Computer, Custom Desktop PC – article with some nice coverage on XDR RAM.

AnandTech: Microsoft's Xbox 360, Sony's PS3 - A Hardware Discussion – Pretty good AnandTech article covering an overall comparison between the Playstation 3 and Xbox360. It is one of those “we’ll make them equal” type comparisons though.

AnandTech: Hardware Behind the Consoles - Part I: Microsoft's Xbox – good AnandTech article covering the hardware between PS2 and Xbox last generation.

Cell Broadband Engine Architecture and its first implementation - technically detailed IBM resource for the first implementation of the Cell processor. Includes some very good detail on what kind of performance you can get from the Cell, and in what situations.

Introducing the IBM/Sony/Toshiba Cell Processor Ã¢€” Part I: the SIMD processing units: Page 1 - pretty good Arstechnica article on the Cell processor – part I.

Introducing the IBM/Sony/Toshiba Cell Processor -- Part II: The Cell Architecture: Page 1 - pretty good Arstechnica article on the Cell processor – part II.

Real World Technologies - ISSCC 2005: The CELL Microprocessor – Real World Technologies article on the Cell processor.

AnandTech: Understanding the Cell Microprocessor – Good Anandtech article on the Cell processor.

IBM JRD 49-4/5 | Introduction to the Cell multiprocessor - Great IBM article giving a good introduction to the Cell.

The Cell project at IBM Research - good site covering the Cell project and various aspects and design goals of it.

Cell BE Handbook v1.0 (May 2006) – I used this mostly to just read up on some of the instructions on the SPEs and insight to some of the problems and applications of SIMD.

Power.org - Cell Developer Corner - Cell Developer Corner. This site has presentations that can pretty much teach you all about the Cell and more detail than what I covered here. It’s a very good resource that developer (especially team leads and software architects) for the Playstation 3 should know as they are working on a project.
*** POST UPDATE ***
Vertex shader – a program that is executed on the GPU that processes vertex information. A vertex shader is a prime candidate for cloth simulation by modifying the vertex position of each point on the cloth surface. It can also be used to simulate water surfaces

Anti-aliasing – common abbreviated as “AA.” This is the process of getting rid of jagged edges and other artifacts that occur in a 3D image. The two approaches to accomplishing this are multi-sample anti-aliasing(MSAA), and full scene anti-aliasing(FSAA).

MSAA takes the frame buffer, and samples it a number of times, progressively producing a less anti-aliased picture. The number of times the image is sampled is represented by the multiplier in front of MSAA. In other words, 4xMSAA samples the screen 4 times – thus has to be read, processed, and dumped to memory 4 times before it is done.

FSAA is rendering the frame buffer at a much larger size than the resolution that will be displayed and then image is then down-sampled once to the display resolution. The multiplier in front of FSAA represents the size of the over-sample. In other words, 4xFSAA will render an image 4 times the size of the actual resolution. This means the frame buffer size will take 4 times the memory required than the actual end display surface.

VMX-128 – This is the SIMD processing unit and instruction set name for the PowerPC core. It was named AltiVec by Motorola who invented it. Due to trademark issues IBM has renamed it VMX-128.

ICT – Image Constraint Token – for HD-DVD and Blu-Ray, if content providers wish to turn this token on in their media, it will only enable playback through approved HDCP signals. If at any point, the signal goes through an unapproved medium, the image quality will be down-sampled to 540p.

PS3HaX Network Blog Spot

Wednesday, March 11, 2009

PS3 Hardware: Explained Part II

No comments:

Post a Comment

My Favorite Websites

My Stalkers

Tutorials

About PS3HaX Network

Music Playlist