Gamestudio Links
Zorro Links
Newest Posts
Change chart colours
by 7th_zorro. 05/11/24 09:25
Data from CSV not parsed correctly
by dr_panther. 05/06/24 18:50
Help with plotting multiple ZigZag
by degenerate_762. 04/30/24 23:23
AUM Magazine
Latest Screens
The Bible Game
A psychological thriller game
SHADOW (2014)
DEAD TASTE
Who's Online Now
2 registered members (VoroneTZ, TipmyPip), 1,324 guests, and 0 spiders.
Key: Admin, Global Mod, Mod
Newest Members
firatv, wandaluciaia, Mega_Rod, EternallyCurious, howardR
19050 Registered Users
Previous Thread
Next Thread
Print Thread
Rate Thread
How a modern CPU works: A tale in three acts #456328
11/18/15 19:07
11/18/15 19:07
Joined: Apr 2007
Posts: 3,751
Canada
WretchedSid Offline OP
Expert
WretchedSid  Offline OP
Expert

Joined: Apr 2007
Posts: 3,751
Canada
Alright, this is something that I wanted to write for quite some time already. Mostly to establish nerd cred. It’s about the inner workings of a modern CPU, which I find to be an incredibly interesting topic and it helps writing performant code. I’ll be focusing on Intel and the x86 architecture, because that is what I mainly know about. CPUs are divided in their architecture, in this case that’s the x86 architecture (ie. each CPU can execute the x86 instruction set), and their microarchitecture, which is the actual implementation detail of how the CPU goes about running the architecture. Intel follows the Tick-Tock model, where one generation they come out with a new microarchitecture which greatly changes how the CPU performs and the next generation is practically the same but much smaller (from 45nm to 30nm for example). Of course, I don’t want to spark a race of premature optimisations, but it helps designing code from the get go to be at least not non-efficient. Anyway, with that being said, modern CPUs are massive beasts of performance and you can do things inefficiently and still get away with it.

General overview

The reason you can get away with it, is because CPUs a complex as hell. Supposedly there used to be a time where a CPU would read an instruction and then execute it all in the same clock cycle. Eg, the throughput of your instructions was dependent on the clock frequency of the CPU. This hasn’t be the case in a very very long time. Nowadays CPUs are pipelined, meaning that the whole reading and then doing something with an instruction is broken up into small stages, and each individual part of the pipeline is run on every clock cycle. A simple pipeline would be to fetch an instruction, decode the instruction and then to execute the instructions. Three stages in total, so an instruction will take at least three clock cycles to complete. First clock cycle it’s fetched, second clock cycle it’s decoded and third clock cycle it’s actually executed. Of course, now the CPU can work on three instructions at once.

The reason for this is that it keeps the circuitry simpler, and less space is needed. Also it allows running the CPU at higher clock rates, because a new clock cycle can only ever begin once the previous clock cycle has completed. The longer the whole circuit, the longer the signal time. If you can keep the whole thing simple and divided into small steps, a clock cycle can be a lot shorter and a new cycle can begin much sooner.

Now, a modern CPU has way more than just three pipeline stages. The Core 2 microarchitecture has a 14 stages long pipeline. Long pipelines are undesirable because there is a huge penalty in case the pipeline has to be rolled back. The Pentium 4 (NetBurst microarchitecture) had an up to 31 stages long pipeline, which allowed for insanely high clock frequencies at the time, but at the cost of having to toss up to 31 instructions away in case the pipeline had to be rolled back (why that might have to be done I’ll get to in a minute). The long story short is that NetBurst was a flop in terms of achieving high performance, however, it turned out to be great at generating a ton of heat.

Another thing is that modern CPUs no longer actually execute the instructions directly but instead translate the instructions. x86 is the big CISC architecture out there, CISC standing for Complex Instruction Set Computing, meaning that the instruction set doesn’t just have a set of basic instructions and complex behaviour has to be composed on top of them (by the programmer or compiler), but instead x86 has instructions for virtually everything. In fact, many of the instructions themselves have so many nuances and “power” that they themselves are turing complete, i.e. a compiler could compile code using only one instruction! Of course, that would be quite bad for performance, but hey, it gives you a nice overview of just how complex the instruction set really is. Anyway, as it turns out, that’s a rather stupid idea if you want to keep your circuitry simple. So a modern x86 CPU will just translate instructions into so called micro-ops. For example, in x86 you can add a value from memory to a register. Executing this would be rather complex, because now the execution engine for integer math suddenly has to be also able to load from memory. So instead, the CPU will go ahead and translate it into two micro-ops, one to fetch the value from memory and store it somewhere temporarily, and a second that adds the value to the register.

So, the CPU will fetch an instruction and then decode it into micro-ops, which are then executed. Great, except, the CPU needs to know what to even fetch in the first place! Imagine a 14 stages long pipeline, and now the CPU decodes an instruction with a conditional jump based on a value of an instruction that is currently at step 9 of the pipeline stage. This could be as simple as an if based on a computed value, the CPU doesn’t know yet if the jump is taken or not, because the instruction still has 5 stages left in the pipeline to complete and the value to be available. Of course now you could just halt everything and wait for the instruction to complete, but that is exactly the opposite of what you want! CPUs need high throughput, very very high throughput. There is a physical limitation to clock frequency, not to mention and economical limit and the way to overcome this is to keep the same frequency but increase the throughput of the CPU. So, stalling the pipeline is not an option. Instead, the CPU has a so called branch predictor, which sole job it is to predict wether a branch is taken or not. The branch predictor will tell the CPU which instruction to fetch next, based on Voodoo and tarot cards. Not really, but predicting the future is hard. If the branch is taken randomly, the CPU will have a very very hard time predicting this, however, usually programs have certain patterns to them, which the branch predictor can detect and then make guesstimate. For example, you may have a loop with a condition that makes the loop repeat 100 times, maybe you are iterating over an array. The branch in this case is at the end of the loop where it’s checked wether the loop is complete or if it should be repeated and the branch predictor will soon detect the pattern that the loops is repeated over and over again, so it will predict the branch to be taken. Of course, this will fail once the condition stops being true, but in the best case you end up with the branch predictor being right 99 times and wrong once! This is great because it can now keep the pipeline highly saturated. So what happens when the branch predictor is wrong? Well, the CPU went ahead and began the process of executing code that turned out to be wrong, which means the pipeline has to be rolled back and the CPU has to start again. This is expensive, because worst case you just emptied out 13 stages of your pipeline and it will take another 14 clock cycles since the next instruction completes. It’s a huge penalty to pay, which is why it is always good to play nice with the CPU and help it out with creating predictable branches.

Now, what is a partially executed instruction anyway? It’s nothing, the CPU doesn’t have to perform any work to actually roll back the instructions themselves because they are not committed yet. Sure, the computation was done for nothing and the pipeline is empty, but the instruction didn’t have any side effects yet. The last stage in the pipeline is the retirement station, where completed instructions are committed. Their values are actually written and the side effects, eg. changes to the flags register are performed. This is important because otherwise the CPU would have to do a huge amount of work when rolling back the pipeline to reverse the work already done, which sometimes isn’t even possible! Zeroing out a register for example is not reversible unless you keep the previous value somewhere just in case.

Another thing the retirement station does is to retire instructions in order. Imagine the following, you have an an instruction that performs some calculations with the EAX and EDI register followed by one that simply zeroes out the ECX register. Let’s say the first one does a multiplication which will actually take 4 clock cycles to complete in the ALU (arithmetic logical unit, the chip that actually does the calculations). Great, now what, you just started one instruction and wait for it to complete so you can start the next one? 4 clock cycles is a long time to wait, especially because the CPU could technically zero out the register already since it has a different execution unit available for that. So that’s what the CPU does, it already starts the next instruction that zeroes out the register, which might complete in one clock cycle already. So now the second instruction makes it to the retirement station before the first one, because the CPU started to perform work out of order. And now the retirement station will put the second instruction on the bench until it’s ready to entire it, which is when all previous instructions have retired. And because the engineers were smart, they made the retirement station able to to retire multiple instructions at once, so once the time has come, you will end up with both instructions retiring at the same time.

Obviously that was a rather simple example, in reality you have a lot of different instructions taking a different amount of time to actually execute, and the CPU will start to perform those instructions out of order to keep the throughput high. As long as instructions don’t depend on each other, the CPU is able to perform them out of order. Of course if you end up with a long chain of instructions that all depend on the previous one, the CPU won’t be able to pull this magic off.

Also, the CPU has multiple execution ports! The execution ports are where the instructions are actually executed, and the CPU has ports for working with memory, integer arithmetic, floating point arithmetic, SIMD instructions etc. Some of them are even duplicated, or at least partially duplicated. You will usually find multiple ALUs, although not all of them are able to perform all instructions. The CPU will try to feed as many instructions to as many ports as it can, so it can do as much work as possible at once. This also means that the previous pipeline stages that prepare the instructions for execution can work with multiple instructions at once. The CPU always fetches and decodes multiple instructions at once, and if it can, it will also feel multiple instructions at once to as many execution ports as it can. And then afterwards, it will retire as many instructions as it can, leading to well optimised code being able to achieve a throughput of up 4 instructions per clock cycle on modern Intel CPUs.


Shitlord by trade and passion. Graphics programmer at Laminar Research.
I write blog posts at feresignum.com
Re: How a modern CPU works: A tale in three acts [Re: WretchedSid] #456329
11/18/15 19:08
11/18/15 19:08
Joined: Apr 2007
Posts: 3,751
Canada
WretchedSid Offline OP
Expert
WretchedSid  Offline OP
Expert

Joined: Apr 2007
Posts: 3,751
Canada
Alright, so far, so good. We are able to pump a lot of instructions through the CPU in very little clock cycles, however, there are two big issues: The first one is that RAM is slow. Like, really slow, it can take upwards of 100+ clock cycles for RAM access to complete, which is obviously not ideal, because you will probably end up stalling your neat little pipeline. Luckily, just like there is branch predictors, there are predictors to figure out what memory to prefetch and make available. Also, the CPU has various caches that are very fast and the results from RAM fetches are stored there temporarily so the next time they are accessed it won’t take long to retrieve them. Of course the space in these caches is limited, so the CPU has to be smart about what it keeps in there and what it discards, so that in an ideal world it never has to stall because of memory access. Also, usually you will end up with a memory access followed by a memory access around where you just accessed memory. For example when reading values from a struct, they are all close to each other and most likely you will read multiple values at once. This is why the prefetcher will always fetch a whole cache line, which is usually 64 byte in size. Even if the memory access is only 1 byte, there is always more memory fetched “just in case”.

The second problem is that 32bit x86 only has 8 registers. And most instructions will either read or write from one of those registers, so there is a bit of contention there. That’s why modern CPUs have hundreds of registers and a register renamer! Instead of EAX actually being associated with one physically register, the CPU just associates it with a register and can freely change whichever register it means when talking about EAX. Of course, since all calculations still have to appear as if EAX was one register it can’t just willy nilly switch what it sees as EAX, but it can break up dependency chains very easily that way. For example, imagine a chain of instructions doing some calculations that uses EAX as temporary storage followed by another chain which first zeroes EAX out and then does some calculations of its own. The CPU can now go ahead and change what physical register the second chain of instruction refers to and both chains can start in and run in parallel. Even more so, the register renaming station can already zero out the register, the zeroing of the EAX register becomes technically a no-op.

Wrap up for now

So, let’s see... A modern CPU pipeline looks something like: Branch predictor -> Instruction fetcher -> Instruction decoder -> Re-Order buffer -> Register Renamer -> Execution engine -> Retirement station. 7. Not quite the 14 we were told the CPU would have, and in fact does have. But that ought to be enough for a very basic overview just to get a rough idea about what the CPU will do. Anyone still with me so far? Question? Anything/Everything unclear? I’ll write a follow up which goes away from just generic let’s look at how it’s done in general and get to know each other, to actually takes a look at a real microarchitecture. Interest?

Edit: Wrong forum, eh? Can a mod move this somewhere more appropriate?

Last edited by WretchedSid; 11/18/15 19:09.

Shitlord by trade and passion. Graphics programmer at Laminar Research.
I write blog posts at feresignum.com
Re: How a modern CPU works: A tale in three acts [Re: WretchedSid] #456330
11/19/15 02:00
11/19/15 02:00

M
Malice
Unregistered
Malice
Unregistered
M



Happy accident. Thanks, very enlightening!

Re: How a modern CPU works: A tale in three acts [Re: ] #456336
11/19/15 08:24
11/19/15 08:24
Joined: Mar 2011
Posts: 3,150
Budapest
sivan Offline
Expert
sivan  Offline
Expert

Joined: Mar 2011
Posts: 3,150
Budapest
for the lazy guys, shortly:
in fact, tiny squirrels are running around inside CPUs.


Free world editor for 3D Gamestudio: MapBuilder Editor
Re: How a modern CPU works: A tale in three acts [Re: sivan] #456338
11/19/15 08:31
11/19/15 08:31
Joined: Sep 2003
Posts: 6,861
Kiel (Germany)
Superku Offline
Senior Expert
Superku  Offline
Senior Expert

Joined: Sep 2003
Posts: 6,861
Kiel (Germany)
This is going to take me at least one hour to read (can't imagine how long it took to write, even longer than Harry Potter posts) but I bet it's worth it. But... not now, some other day!
Thanks already for the insight though.


"Falls das Resultat nicht einfach nur dermassen gut aussieht, sollten Sie nochmal von vorn anfangen..." - Manual

Check out my new game: Pogostuck: Rage With Your Friends
Re: How a modern CPU works: A tale in three acts [Re: Superku] #456350
11/19/15 16:46
11/19/15 16:46
Joined: Apr 2007
Posts: 3,751
Canada
WretchedSid Offline OP
Expert
WretchedSid  Offline OP
Expert

Joined: Apr 2007
Posts: 3,751
Canada
Hey man, it's multiple books of 800+ pages each condensed down to 2465 words. Be glad I ain't making you read those instead (although I should probably link them for further reading). Anyway, I realize it's a bit of a dry topic, but I just wanted to see if anyone even has any interest in this and if I should write more about it. I'm kind of hoping for constructive criticism so I can save the money for an editor and write a long blogpost about it. Thanks for being my guinea pigs laugh

Edit: Also, according to the program I used to write this, it should take you 12 minutes and 19 seconds to read. Get going!

Last edited by WretchedSid; 11/19/15 16:48.

Shitlord by trade and passion. Graphics programmer at Laminar Research.
I write blog posts at feresignum.com
Re: How a modern CPU works: A tale in three acts [Re: WretchedSid] #456353
11/19/15 17:11
11/19/15 17:11
Joined: Jun 2007
Posts: 1,337
Hiporope and its pain
txesmi Offline
Serious User
txesmi  Offline
Serious User

Joined: Jun 2007
Posts: 1,337
Hiporope and its pain
This weekend I will take time for read for sure.

Thank you for your effort!

Re: How a modern CPU works: A tale in three acts [Re: txesmi] #456354
11/19/15 17:19
11/19/15 17:19
Joined: Mar 2012
Posts: 927
cyberspace
W
Wjbender Offline
User
Wjbender  Offline
User
W

Joined: Mar 2012
Posts: 927
cyberspace
have read it after you posted it , it is interesting but doesn't interest me so I know next to nothing about it except it does what it was designed to.

we will see how long it lasts , with progression comes change.


Compulsive compiler

Moderated by  HeelX, Lukas, rayp, Rei_Ayanami, Superku, Tobias, TWO, VeT 

Gamestudio download | chip programmers | Zorro platform | shop | Data Protection Policy

oP group Germany GmbH | Birkenstr. 25-27 | 63549 Ronneburg / Germany | info (at) opgroup.de

Powered by UBB.threads™ PHP Forum Software 7.7.1