Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

GCN Assembler for AMD GPUs

0.00/5 (No votes)
17 Apr 2016 1  
an assembler/compiler for AMD’s GCN (Generation Core Next Architecture) Assembly Language

Download Notes: This project requires specific hardware. (see requirements for details)

 


Contents

Introduction
Background

Asm4GCN Assembler Sub-Project
   • Labels
   • Variables
         ◦ Declaration (Updated)
         ◦ Inline Declarations 
         ◦ Using a Variable
         ◦ Renaming
         ◦ Variable Indexing (Updated)
         ◦ Free a Variable (Updated)
         ◦ Forcing Register Number
    • Flexible Constants 
    • #define Support
    • #s_pool / #v_pool Command
    • Multiple Statements Per Line & Semicolons 
    • Project Files
    • Point of Interest
         ◦ Inline Assembly 
         ◦ Smart Register Packing
OpenCLwithGCN Sub-Project
    • Features 
         ◦ Built-in Text Template Engine
         ◦ Mixing OpenCL & Asm4GCN Kernels
    • Using OpenCLwithGCN
    • OpenCLwithGCN Project Files
Asm4GcnGUI Windows Interface Sub-Project
    • Features of the GUI Interface
General Topics on GPU Assembly
    • ​GPU Assembly Advantages/Disadvantages
    • Preloaded Register Values
    • GCN Assembly Code Writing Tips
Limitations
Future Wish List
Videos
System Requirements
History

Other GCN Assemblers
Thanks To...


Introduction

This GCN GPU assembly language compiler converts human-readable assembly to binary machine code for AMD GCN equipped GPU processors. Assembly language is a friendly abstract way to view machine code. With some exceptions, assembly statements are map directly to individual machine instructions. Applications that are time critical or need to use special hardware features often use Assembly. An example use of GPU assembly might be for a bit-coin mining application. Bit-coin mining is compute intensive and benefits from small performance gains. Bit-coin mining using assembly can also take advantage of special hardware instructions or instruction features that otherwise could not be accessed. Furthermore, writing in assembly can often be used to combine instructions in creative ways. Of course, writing assembly does have its drawbacks in the areas of maintainability and compatibility – see the disadvantages section for details.

Besides the core assembler, this project includes two additional Visual Studio projects. Each project wraps another in a shell like way. The inner core is the Asm4GCN assembler project. Wrapping that is the OpenCLwithGCN project that injects assembly binaries into dummy OpenCL kernels. Finally, there is the Asm4GCNGUI project, the outer most layer. This final windows application pulls everything together in a friendly syntax-highlighting editor. This editor also doubles as an example of using OpenCLwithGCN.

Here is a quick introduction of each project starting with the assembler core and working out to the user interface:

Asm4GCN Assembler – The core is the Asm4GCN assembler/compiler itself. It converts assembly statement blocks into their binary representation. It handles variables, predefinitions, labels, and register management. It returns warnings and errors to the user.

OpenCLwithGCN – This takes an OpenCL program that has a mix of OpenCL and Asm4GCN kernels and then inserts (or patches-in) the binaries from the gncAsm kernels. It also runs the program.

Asm4GCNGUI – This simple windows application implements the Asm4GCN for writing code in a small editor. The editor’s syntax highlighting and code completion make it handy for writing GCN assembly. The editor allows for quick testing of Asm4GCN kernels and maybe useful for testing or educational purposes. Just write some assembly and OpenCL kernels with C# based host code and see to what happens.

Background

In times past, I wanted to take advantage of the extra performance benefits of writing in assembly as well as take advantage of special GPU hardware features that were only available through assembly. To satisfy this for NVidia GPUs I can inline some PTX using Cuda’s asm function. PTX is not true assembly but it does provide a near like experience. For AMD on the other hand, there is not really a straightforward way to "inline" assembly in OpenCL.

The optimal solution would be to have an asm function in OpenCL however, this is not currently supported… at least not for the most part. There have been a couple posts online where users have partially got inline asm to work however there is one major problem– variables cannot be passed in/out. If variables cannot be passed in then an inline assembly is not so useful. Furthermore, any registers that are used might stomp on some in-use OpenCL variables.

Since there is nothing that I knew of that supports inline GCN assembly in OpenCL, I decided to attempt it myself. My original intentions were to have inline assembly however, it did not pan out as well as I liked – at least not yet. The complexity of creating a dynamically sized dummy kernel and then capturing the used registers is complicated and my attempts were not reliable. After much effort, the original goal of inline assembly has been put on the back burner, and it was built so a cl::program can have a mix of OpenCL __kernel and GCN __asm4GCN kernels. The framework for doing inline assembly is half way in-place but I am still struggling with capturing the registers used for the dynamically sized dummy kernels. I am not sure if inline assembly will be added in the future. (see Inline assembly for more)

One note I should add here is there is currently a windows app, called HetPas, which generates elf-image files that can be loaded into an OpenCL program. This great app, created by Realhet, is a closed source windows project that takes GCN assembly kernels and Pascal-like host code and then runs them – similar to this project. I have been using HetPas for a while and it works well. The only part I struggled with is the Pascal side of things.

Asm4GCN Assembler Sub-Project

This project comprises of three Visual Studio projects. The Asm4GCN assembler, OpenCLwithGCN, and the Asm4GcnGUI. The Asm4GCN project is the core assembler for this project and converts a block of assembly statements into raw binary code.

For Example: [s_mov_b32 v1 s0] is converted into a hardware readable [7E020200]

Besides converting each statement to its hardware code counterpart, the Assembler also handles other items like labels and jump statements. I think the best way to show what the Assembler does through flowcharts. For simplicity, some steps have been left out.

  1. A block of GCN assembly is passed into the Asm4GCN assembler to be converted into binary.
  2. The first (or next) line is started.
  3. Remove any whitespace on this line. Comments, empty lines, and extra whitespace are removed.
  4. If the row starts with a label header (i.e. myLabel: ) it is recorded in the labels list here.
  5. #s/v_pool and #define
  6. Processes any #defines labels. For each match the keyword is replaced.
  7. Spit up line into multiple statements using “;”.
  8. Are there any (more) statements on this line? If Yes then let's parse the statement, if No then continue to next line.
  9. The statement is split up into a string array. V_Mov v2,v4 is translated into {“V_Mov”, “v2”, “v4”}
  10. Process any variable initialization or destruction. Here registers are reserved or returned back to the pool using free.
  11. If the current instruction uses any variables, then they are replaced with their current register number.
  12. Convert hex, binary, octal, and scientific notation strings into constants.
  13. If the current instruction references any labels, then we will hold off on converting the instruction to binary since we do not know the constant value.
  14. The current instruction is converted into its binary version.
  15. The current instruction is added to a list for later processing. 

 

The next step is to fill the stmt.opSize for any statements that do not have it filled in yet. After this is done, some final cleanup and then a binary is created.

16. Creates a list of Instructions with an unresolved size (opSize). These are instructions with labels where the distance is not known.
17. Processes each instruction one by one until we get to the end of the list.
18. If the min and max distance to the label are the same then the instruction size must be same.
19. Set the byte size to whatever the equal min/max was and then continue to the next instruction.
20. Is the byte size still 4 bytes even with the farthest jump distance? If so then it will always be 4 bytes.
21. Set OpSize to 4 bytes
22. We have finished enumerating the list. Are there any unresolved OpSizes? If so then we will need to enumerate again.
23. Convert any final instructions that had labels into binary form.
24. Join all the 4 and 8 byte instructions into one single byte array.
25. Return the finished binary. Other information is also returned like register usage as well.

Labels

One of the assembler’s main tasks is calculating the distance of jump labels and converting them into actual numbers. Doing this is one of the more complicated parts of GCN assembly - at least to do it correctly. It turns out that processing label distances, is much like the “chicken or the egg” dilemma. Instructions that have labels in them could be either 32-bit or 64-bit depending on the jump length but we do not know the jump length until we know the sizes of the instructions in-between the jump and label. When there are no labels between a jump and its destination life is easy, however as soon as a jump passes over other jumps we start to have problems because we do not know the exact jump distance any longer. 

To resolve this, each jump statement is tried with the shortest and longest possible distances. The longest distance is the sum of the sizes using a 64-bit size for the unresolved instructions. The shortest distance would be the same but using 32-bit for the unresolved. At any time, if the shortest and longest both result in a 32-bit or 64-bit instruction, then we can lock in that size. We do this with each unresolved statement repeatedly until all the jump statements have a fixed size. After we know the size of each instruction, then the labels can be replaced with the appropriate jump-length constant. There are a couple more steps but that is the main idea.

As an alternative to the above, one easy, but not as efficient way to resolve this is to just use the 64-bit version of the instruction when we are unsure if the distance will fit in a 32-bit instruction. A 64-bit branch instruction can jump distance up to +/- 8TB!

Some features of the label system:

  • Labels can be used in place of most constants. A label is a constant as well. It is just the distance to another instruction so we can use a label wherever a constant would normally be used. For example, a statement like s_mov s2, MyLabel would simply move the distance from MyLabel into s2. Just remember s_sub s4, MyLabelA, MyLabelB would fail in this case because s_sub cannot contain two constants.
  • Multiple labels with the same name are supported. For myself, I often want to re-use a label name or I run out of my favorite jump labels. When using multiple labels with the same name, the assembler will use the nearest matching label. Label re-use can be handy when copying and pasting blocks of code as there is no requirement to update the label names ... in most cases. The compiler will throw a warning though as label name reuse can sometimes result in unintended jumps. 

Variables

Writing an assembly block with more than 10 to 20 statements gets complicated quickly. Some of the reasons for the complexity is trying to remember what each register does and whether it is free or not. The ability to checkout a register with a friendly name is extremely useful. For example, a variable named LoopCount is much more descriptive then S23:24. In addition, the ability to checkout a register keeps a programmer from accidently over-writing an existing register. When the number of free registers becomes low, it can be time consuming to try and hunt down a free register. Variables abstract the programmer from having to remember what each register number does and what registers are available for use.

Example:

v4i myVar1, myVar2, mySum;
v_mov_b32 myVar1, 10;
v_mov_b32 myVar2, 20;
v_add_i32 mySum, myVar1, myVar2
free myVar1, myVar2

The variable system for Asm4GCN has three parts – the declaration, using the variable, and freeing the variable.

Declaring a Variable

When declaring a variable, use a variable type declaration keyword such as v4i, v4f, v8f, s8f, etc. When the compiler runs across a declaration it will immediately reserve the first free register(s) from the allowed register pool. Instead of using a long declaration like scalar unsigned int or vector long long, I tried to simplify it by using a short 3-digit declaration. The first digit determines vector or scalar. The second digit is the size in bytes. And the last digit is the intended data-type of the variable. Following the type is the variable name itself. This first character must be a letter or underscore and the remaining characters can contain numbers as well.

Example declarations:

 v4i _my_int_vector;
 v8u myUnsignedLongVector;
 s8f myDoubleScalar1, myDoubleScalar2; //multiple items can be declared at once
 s4u myAddrForcedToS10 s10;     // forced to 10 (used for incoming params)
 v8f myDoubleForcedTo2 v[2:3];  // either v[2:3] or just the beginning, v2, can be specified

The declaration starts out with a type followed by one or more variable names. I wanted to keep the variable types short and simple so I came up with a three character variable type that holds type, size and datatype.

First Digit 

The first digit specifies the memory space this variable will live in. This is either S for scalar or V for vector memory. This tells the register reservation system to reserve either a Scalar or Vector register.

Second Digit (Updated!)

The middle digit(s) specify the size in bytes of the variable. The size tells the compiler to reserve 1 or more consecutive registers to hold the variable as well as how to align them in memory. A size value of 2 would represent 2 bytes but would take up a full DWORD sized register. A value of 4 would also use one register. Currently there is no difference in what the compiler does for 1, 2, or 4 – this is just informational. However, if 8 is used, this does affect the compiler in that it will reserve two consecutive registers. Multi-register variables are also aligned - a two-register variable would contain an even numbered register followed by an odd numbered register. If 16 is used this would use 4 registers and be aligned by 4. Anything larger than 16 is still always aligned by 4.

Alignment and Registers used by Size
  Regs Used Alignment
1 byte 1 Register 1 - any register
2 byte 1 Register 1 - any register
4 byte 1 Register 1 - any register
8 byte 2 Registers 2 - must begin on reg divisible by 2
16 byte 4 Registers 4 - must begin on reg divisible by 4
32 byte 8 Registers 4 - must begin on reg divisible by 4
Last Digit

The final digit is the intended datatype. This is either F for float, I for integer, U for an unsigned integer, or B for bits. The “bits” format is really a ketch-all format. Bits can really represent any other format including bool. Unfortunately, this last format specifier is not used by the compiler yet but it still provides useful information for the programmer.

My original intentions were to simplify GCN assembly by having the ability to do simple commands such as R2 = R3 + R4 where type information would be needed. Also types could be used to throw warnings. If a float type is specified but used with a v_add_i32 then this could throw a warning. These extra features may or may not be added in the future.

Forcing Register Numbers

When declaring a register, it can optionally be forced to a specific register. It can be assigned a physical register number or it can reuse a register from a past variable. 

The two methods to hardcode a register numbers:

  1. By using a fixed register number. (i.e. v4u myLaneID v0 )
  2. Or, by using a past variable name. (i.e. v4u myNewVar somePastVariable )

A fixed register number is typically used to capture pre-assigned registers when a kernel lunches. These can be items like parameter addresses, block id/size, or lane id. Variables with a fixed register should be used near the beginning of a kernel. If it is used later in the code, it could fail because a variable could have automatically been assigned that register.

The other method of assigning registers is to copy the register number from a past variable. This is useful in some situations.

Sometimes the meaning a register changes and the variable name is no longer accurate. One way to handle this is to "rename" the existing variable however this can be ugly. This was previously implemented but has since been removed. A better way is to free a variable and then reuse the same register in a declaration. This results in cleaner code then the awkward rename statement. It also provides more flexibility. 

When creating a new variable, the programmer might want to use only part of an existing register. For example, let us say that there is a variable that eats up 4 registers but for some reason there is only a need to preserve the last register. Something like this could be done: s4u myNewVar myPast16SizedVar[3]; Now the first three registers of myPast16SizedVar are free and the last register can stick around with a different name. The question might come up of why not just use s4u myNewVar; s_mov_b32 s4u myNewVar myPast16SizedVar[3]. Two issues: (1) there would be a 5 register peak usage and (2) s_mov_b32 is a hardware instruction so that would result in larger (and slower) code. The earlier register reuse method takes no instructions - it is just better register management. 

Inline Variable Declarations

When programming, especially in assembly, register declarations can take up a lot of space. I was trying to think of a way to solve this problem and recalled the C-style way of doing this. In C it can be done like int x = y + z; but after juggling things around I came up with a format like v_add_i32 int x, y, z

Let demonstrate it in a real-world example:

First, without inline declarations...

v4u vLocalSize
v_mov_b32     vLocalSize, localSize
v_mul_i32_i24 vLocalSize, groupId, vLocalSize 
v4u localSizeIdx
v_add_i32     localSizeIdx, vcc, laneId, vLocalSize 
v4u vGlovalID
v_add_i32     vGlobalID, vcc, baseGlobalId, localSizeIdx
v4u vGlobalOffset
v_lshlrev_b32 vGlobalOffset, 2, vGlobalID

And now with inline declarations...

v_mov_b32     v4u vLocalSize, localSize
v_mul_i32_i24     vLocalSize, groupId, vLocalSize
v_add_i32     v4u localSizeIdx, vcc, laneId, vLocalSize 
v_add_i32     v4u vGlobalID, vcc, baseGlobalId, localSizeIdx
v_lshlrev_b32 v4u vGlobalOffset, 2, vGlobalID

What is cleaner? I like the later of the two.

But wait, there is more. Since registers can be are declared and automatically freed on the same instruction the same register can be recycled. This can result in better performing code by reducing the number of registers. For Example, in the above code, v_add_i32 v4u vGlobalID, vcc, baseGlobalId, localSizeIdx can free localSizeIdx and then reuse that same register on the same line. It might be translated into something like v_add_i32 v2, vcc, s8, v2. Notice that v2 is being reused in the same instruction. 

Using a Variable

Just use a variable name as needed in your code instead of using register numbers (i.e. v17 or s[20:21]). Under the covers, whenever the assembler runs across a variable in the assembly it will do a dictionary lookup to retrieve its register number and type. It will then replace the variable name text with the register type and number. i.e. v_add_i32 v3, v4, myInt; to v_add_i32 v3, v4, v7;

Variable Indexing

Variable indexing adds the ability to access a particular register in a multi-register variable. For example, with s8b myVar there might be a need to access the second register. This can be achieved by appending a [1] to the variable.

An example of when this might be needed is when adding two 64-bit values:

v8i myInt1, myInt2; // to be added together
[ myInt1 and myInt2 are assigned values here ]
v_add_i32 myInt1[0], vcc, myInt1[0], myInt2[0]  // note: "[0]" is actually redundant on this line
v_add_i32 myInt1[1], vcc, myInt1[1], myInt2[1]

Free a Variable

A variable's registers can be freed by either automatic freeing or by using the free keyword. In the background, these mark the variable as finished and mark the register number(s) as free. In most cases, automatic freeing will suffice.

Automatic Freeing

A register is freed automatically on the last statement it was used on. The compiler records all the positions where the variable is used and will automatically free that register on the statement it was last used. Automatic freeing also can recycle registers in the same statement so it is more efficient as well. 

Manual Freeing

Use the free keyword followed by a variable name to manually free a variable.

v4u myVar1, myVar2
...
v_add_u32 myVar1, myVar1, myVar2 <-- myVar1 and myVar2 last used (myVar2 auto freed here) 
..
free myVar1 <-- myVar1 is freed here

The free can be followed by one or more variable names separated by commas. i.e. free myVar1, myVar2

This will force the register(s) that are assigned to them to be freed at the statements location. Free is used to keep variables around longer than they normally would with automatic freeing. There are a couple instances where this can be useful.

One reason might be when accessing registers using GPR Indexing with v_movrels or v_movreld. Since using a v_movrel instruction does not use the register number directly the automatic freeing does not know that is being used later in the code and will free it early.

A second reason to keep variables around longer is that there are instances when we might jump back into some code and expect a variable to be active still. If a variable is automatically deleted because it is last used on line 20, and the register is recycled by a new variable on line 21 then later we jump back to line 18 then those registers would have been wiped out. If we add a free statement later in the code, then the compiler will keep that variable and its register(s) around longer. 

Flexible Constants

Inline constant, also known as literals, can be used with instructions that support them. Asm4GCN supports a large variety of constant types: Decimal (250), Hex (0xFA), Octal (0x 372), Binary (0b11111010), exponent (25E1, 25E+1), and Label (myLabel). You are probably thinking why is label listed? Well in AsmGCN a label is just the distance to that label in bytes so it can be used in place of any constant.

Valid Examples:

s_add_i32    s3, s4, 12             // pos int
s_add_i32    s3, s4, -12            // Small neg.
s_min_u32    s5, s6, 0xabcd         // Hex
s_min_u32    s5, s6, 10e2           // Exp
s_min_u32    s5, s6, -10e2          // Neg Exp
s_mov_b32    s4, 2.                 // float 
s_mov_b32    s4, -20.0              // neg float 
s_mov_b32    s4, .5                 // float 
s_mov_b32    s4, -.5                // float 
s_mov_b32    s4, 343.432            // float 
s_mov_b32    s4, 3.4e4              // exp float 
s_mov_b32    s4, -34.4e-4           // exp float 
s_mov_b32    s4, 0o7654             // Octal  (4000/FA0)
s_mov_b32    s4, 0b0011111111       // Binary (255/FF)

#define Support

The assembler supports c-style #define statements. This is done by a simple find-and-replace for each #define in the code block. For example, #define _fe_ 54321 would replace all the _fe_ in the code with 54321. The reason for the surrounding underscores is to prevent accidental usage. If we just used fe we could accidently replace any fe with a 12345 so v_bfe_u32 would become v_b12345_u32. #defines without underscores can still be used however, they will generate a compiler warning.

#defines with parameters are also supported, so something like #define _world_(AAA) 1AAA1 would search for items like int myNum = _world_(00); and translate them into int myNum = 1001;.

Valid Examples:

#define _hw3_ Hello World!
<span style="font-size: 9pt;">#define _hw1_(opt0) Hello opt0 World!
</span><span style="font-size: 9pt;">#define _hw2_(opt0,opt1) Hello opt0 World from opt1!</span>

#s_pool and #v_pool

Most of the time you might want to use registers 0 through 255 however sometimes there might be a need to compile using a particular set (or pool) of registers. #s_pool and #v_pool can be placed at the top of the assembler text to do this. When this compiler command is hit the register reservation system will use these registers for variables. Normally this would be put in the top of your code but you can actually use it anywhere – a warning will appear if it is not near the top.

Valid Examples:

#S_POOL s22, s23, s24, s27, s29, s30, s31, s33, s34, s35, s36, s37
#V_POOL v11, v12, v13, v14, v15, v17, v19, v20, v21, v23, v24, v25

Multiple Statements per Line and Semicolons 

In most cases, multiple statements can be used on a single line. If this is the case, then a semicolon (;) can be used to separate the statements. All instructions and most commands with the exception of #v_pool, #s_pool, and #define support multiple statements per line. For lines with a single command or instruction, a semicolon can optionally be used.

Examples:

s8u myScalar; v8u myVector; v_add_i32 myVector, myVector, myScalar // this is okay
<span style="font-size: 9pt;">v_add_i32 v0, v1, v2  // this is okay
</span><span style="font-size: 9pt;">v_add_i32 v0, v1, v2; // this is okay
</span><span style="font-size: 9pt;">v_add_i32 v0, v1, v2; #define _myDef_ 12345678  // fail - #define must be on its own line</span>

Asm4GCN Project Files

The following is a list of the source files along with their description. The number of lines appears in parenthesis after the name of the file. The project has about 10,000 lines (includes comments and whitespace).

  • GcnISA.cs (1756) - This file contains all the raw data about GCN. It has dictionaries, arrays, and enums of instruction information, register aliases information, etc.
  • Encoder.cs (1404) - This file contains the static GcnParcer class. This class contains methods for each GCN encoding format whose job is to convert a single statement line into its OpCode binary format. This is the core of the GCN assembler.
  • GcnBlock.cs (679) - GcnBlock.cs contains the GcnBlock class. This class is responsible for converting a block of instructions into a byte[] binary form.
  • DataStructs.cs (50) - Contains miscellaneous random structures like GcnStmt, Define, and AsmVar.
  • Labels.cs (109) - Holds the Label and Labels class. This class is for responsible for keeping track of labels and jump distances.
  • ParseOperand.cs (346) - This static tool-like class is responsible for parsing operands. It converts hex, octal, and binary strings to constants as well as verifies that the datatype is allowed.
  • Program.cs (208) - Contains functionality for when Asm4GCN is used in command line. Most projects however will link functions like in a dll. 
  • RegPool.cs (356) - Keeps track of register usage using the RegPool class. RegPool keeps an array of currently available registers. This class is initialized by a supplied a list of allowed registers or a range starting from zero.
  • RegUsage.cs (111) - The RegUsageCalc class keeps a usage count of each register size. It also remembers the maximums with what line they occurred on. This is for information usage or for knowing how many registers an inline statement might need. (inline not implemented)
  • Log.cs (122) - Contains a logging class for handing output. It can output to a StringBuilder or directly to a console.
  • Tools.cs (76) - Contains a static extension class that has some useful extensions like IsBetween().
  • TestInput.txt (166) - This file contains examples of GCN assembly. It also doubles for testing.
  • Variables.cs (409) - Contains most classes and tasks that relate to variables. It also is the owner of the RegPool and RegUsage instances.

Point of Interest

Inline assembly – The inline assembly did not really pan out for this project but I wanted to include some notes about in case anyone wanted to give it a go. I made it a little more complicated originally (not shown here) and got burned out. The “inline” code is still there however it needs to be finished – either by someone else or myself. Here is an outline of the plan:

  1. Location and extract the inline Asm4GCN Blocks (into strings)
  2. Next, assemble the pulled Asm4GCN blocks to binary and note the byteSize, sReg and vReg counts. This is done by assembling with the “OutputUsageMode” option. In this mode, GCN assembly are compiled with temporary registers and the count of each size and type(S or V) of register is recorded. The maximum using points is important here because.
  3. In the OpenCL kernel we need to replace inline assembly with some generated dummy OpenCL code. The dummy code byte size would need to be the same size or slightly larger to fit the inline assembly bin. The dummy code would also need to use the same number of correctly sized sReg and vRegs as well as the same parameters. It also must be surrounded by barrier and mem_fence to prevent OpenCL from reordering the code or else the dummy code will be hard to identify later for replacement.
  4. Now compile the OpenCL dummy code to binary and note what sRegs and vRegs were used. Knowing what registers that were used is important because it tells us what registers we can use in the assembly. When coding the inline assembly, we will need to use variables instead of fixed register numbers since we do not know what pools of registers will be used later on.
  5. Next, re-assemble the inline assembly but this time we will prepend an allowed register pool to the top with the commands #S_POOL and #V_POOL. For example, #S_POOL s22,s23,s24,s27 would tell the assembler to use these registers for variables.
  6. The final step is to locate the dummy binary in the entire program binary and then replace it with the finished binary. RegEx can be used for this.

What was not finished? Steps (3) – This is kind of working but not as well as I hoped. It does create a kernel of close to the right size and the register counts are close also but not exact. The code for my attempts are in FillerKernelAttempts.cl: DummyFillerCode(). Step (6) – this was not finished. The start of the dynamically created DummyFillerCode binary seemed to be difficult to find.

Smart Register Packing/Reservations – The register reservation system developed for GCN4Asm is not as simple as finding the first free register. The reason is that depending on what free registers are chosen, a different amount of registers will fit. It is like fitting boxes in a moving truck. If boxes are just put in anywhere then it’s going to take up a lot of space. However, if the location of the boxes is carefully chosen then more boxes will fit.

I did some online research on how to best allocate registers but I did not have any luck finding algorithm that would directly answer my question of how to do this. I realized register reservations are much like 'rectangle packing'. Rectangle packing is the art of fitting smaller rectangles into a larger one. Using this, the small rectangles can represent the variables and the large rectangle can be the entire free register space. A registers width will be the width of one of the small rectangles and a registers lifetime is the height of the rectangle. Below is an example where each letter represents a variable's lifetime.

Visual Register Usage Example
  R0 R1 R2 R3 R4 R5 R6 R7 R8 R9
Inst 1 A A                
Inst 2 A A B   C C C C    
Inst 3 A A B   C C C C D D
Inst 4     B   C C C C D D
Inst 5     B           D D
Inst 6 E   B   F F        
Inst 7 E       F F        
Inst 8 E   G G F F        

 

 

 

 

 

 

 

So using something similar to rectangle packing, all we need to do is fit these small rectangles (registers) in a large rectangle. Again, the large rectangle represents all the registers (the width) along with the entire lifetime of the code block (the height). The goal is to fit all the small rectangles and minimize the width of the large rectangle. The width of the large rectangle is the number of registers we use in total so we want to minimize that as much as possible. The fewer registers used the better the occupancy we will achieve.

There are some differences between register packing and rectangle packing though:

  1. For the most part, the start and end of a registers life is mostly fixed and not adjustable so the rectangles cannot be moved up and down whereas in rectangle packing they can. There is an exception this however in assembly but this is not typically a task handled by assemblers. Assembly statements can often be re-ordered or moved up and down to minimize register usage - see here.
  2. The registers are typically powers of two (1,2,4,8,16...) in width whereas rectangle packing generally permits any width. Maybe there is an opportunity for additional optimization here.
  3. Finally, registers with a width of two need to be aligned by two (1st register must be even) and registers 4 or larger need to be aligned by four. There is some added code to make sure this happens.

Even though rectangle packing is not a perfect fit, it is a well-known and documented area of computer science and there are some well-known methods to find a near-optimal solution. Probably the most common and simple of these is the greedy large to small algorithm. I remember learning this one in my computer science class! This algorithm starts by inserting the largest item and then works its way to the smallest. It is pretty simple, yet effective.

For the register packing, I followed a similar system but instead of sorting the blocks by size, I fit them into the smallest spaces I could. If a rectangle fits perfectly in a free slot then it uses that, or else, it fits itself into the smallest free slot it can. The algorithm uses a scoring system to decide this.

In a nutshell, GCN4Asm is smarter than just picking the first free register space but at the same time it is not a highly optimized algorithm either.

OpenCLwithGCN Sub-Project

This project makes it possible to use Asm4GCN in C# projects. Its main function is to replace Asm4GCN kernels with dummy OpenCL kernels, and then after assembling, replace the dummy OpenCL binary with the Asm4GCN kernel binary. The OpenClwithGCN project allows the mixing of Asm4GCN and OpenCL kernels in the same cl::program. I think the best way to describe this is with a flowchart:

Features of OpenCLwithGCN

Built-in Text template engine

When programming in assembly it is almost mandatory to have some kind of mechanism to repeat text dynamically. One popular need for assembly programmers is to unroll loops. In higher level languages, compilers do this automatically but at the assembly level it is the programmer’s responsibility. 

This project uses a, C# based, text-template-transformation engine, that can be used to manipulate assembly. It is comprised of C# code tags that are run on the assembly text before being compiled – much like predefinitions. For the GCN text template mechanism, I chose to wrap the C# code in [[..]] style tags. To use it, just put in any C# code inside double brackets to control the flow of what gets displayed. Variables can be printed anywhere using [[=myVar]]. There are additional features that can be reference at http://www.codeproject.com/Articles/867045/Csharp-Based-Template-Transformation-Engine.

Text Template Engine Examples
Original Source s Expanded
[[for(int i=3; i<7; i++) {]]
  v_mov_b32 v[[=i]], v[[=i+4]] [[ } ]]
 
v_mov_b32 v3, v7
<span style="font-size: 9pt;">v_mov_b32 v4, v8
</span><span style="font-size: 9pt;">v_mov_b32 v5, v9</span>
// Created [[=DateTime.Now]]
// Created 1/24/2015 8:12 PM


Mixed OpenCL and Asm4GCN kernels in one cl::program

___kernel and __asm4GCN kernels can be mixed in the same openCL cl::program. Originally, the goal was to have OpenCL kernels with inline functionality but this did not pan out – at least not yet. So I decided to go with separate ___asm4GCN and__kernel kernels. This works out well because both types of kernels can be combined in a cl::program with steams. Note: The currently version of OpenCLwithGCN only supports a single __asm4GCN. (See limitations)

Using OpenCLwithGCN

To show how to use OpenCLwithGCN, I will walk through Example1.cs from the project.

The first step will be to add OpenCLwithGCN.exe and NOpenCL.dll as project references. Even though OpenCLwithGCN is an executable, it can still be added as a reference. After the references are added then we can start writing some code.

We start out with creating some GCN assembly in one of a couple of ways. The first way we could do this is just to start writing assembly directly. Unless the programmer knows exactly what they are doing, this is difficult because it is hard to determine what registers the parameters will use. (see pre-loaded register values) Another way we can do this is to write a simple OpenCL kernel and then decompile it. After it is decompiled into assembly, we can copy and paste the assembly into our program and modify as needed.

For this example, I will use the decompiled assembly from the myOpenClFunc(...) kernel in example1.cs (see below). The myOpenClFunc(...) in example1.cs is shown for reference only. A cl::program can a mix of normal OpenCL kernels and asm4GCN kernels. To generate the assembly, I used AMD's CodeXL.

string source = @"
    __asm4GCN myAsmFunc ( float*, float* )
    {
      #define _32Float_    0 offen format:[BUF_DATA_FORMAT_32,BUF_NUM_FORMAT_FLOAT]
      s_buffer_load_dword  s0, s[4:7], 0x04
      s_buffer_load_dword  s1, s[4:7], 0x18
      s_waitcnt            lgkmcnt(0)
      s_min_u32            s0, s0, 0x0000ffff
      s_buffer_load_dword  s4, s[8:11], 0x00
      v_mov_b32            v1, s0
      v_mul_i32_i24        v1, s12, v1 
      v_add_i32            v0, vcc, v0, v1
      v_add_i32            v0, vcc, s1, v0
      v_lshlrev_b32        v0, 2, v0
      s_load_dwordx4       s[12:15], s[2:3], 0x60 
      s_waitcnt            lgkmcnt(0)
      v_add_i32            v1, vcc, s4, v0
      tbuffer_load_format_x v1, v1, s[12:15], _32Float_
      s_buffer_load_dword  s0, s[8:11], 0x04
      s_load_dwordx4       s[4:7], s[2:3], 0x68
      s_waitcnt            lgkmcnt(0) 
      v_add_i32            v0, vcc, s0, v0
      s_waitcnt            vmcnt(0) 
      v_add_f32            v1, v1, v1
      tbuffer_store_format_x v1, v0, s[4:7], _32Float_ 
      s_endpgm 
    };

     __kernel void myOpenClFunc ( __global float* cl_input, __global float* cl_output )
     {
       size_t i = get_global_id(0);
       cl_output[i] = cl_input[i] + cl_input[i];
     }; ";

In the next step of our example, we initialize OpenClWithGCN and then compile the GCN. Also note that we grab the default environment created by OpenClWithGCN for later use.

OpenClWithGCN gprog = new OpenClWithGCN();
OpenClEnvironment env = gprog.env;
bool success = gprog.GcnCompile(source, out log);

At this point, we are pretty much done with OpenClWithGCN. In rest of this example, I will use NOpenCL to setup OpenCL variables and then execute the kernels. I prefer NOpenCL because the wrappers are almost a line for line match for the C based OpenCL. In addition, the rest of this example uses a modified OpenCL example written by Derek Gerstmann (UWA). I tried to adopt the example but I do not think I did it justice.

Next we create a Kernel from our modified Program.

Kernel kernel = env.program.CreateKernel("myAsmFunc");

Allocate the cl_input and cl_output device buffers, and fill them with random data…

Mem cl_input = env.context.CreateBuffer(MemoryFlags.ReadOnly, dataSz);
Mem cl_output = env.context.CreateBuffer(MemoryFlags.WriteOnly, dataSz);

// create some random data for testing
var random = new Random();
const int count = 1024 * 1024;
const int dataSz = count * sizeof(float);
float[] data = (from i in Enumerable.Range(0, count)
               select (float)random.NextDouble()).ToArray();

// Copy our host buffer of random values to the cl_input device buffer...
env.cmdQueue.EnqueueWriteBuffer(cl_input, true, 0, dataSz, data);

Now let us add the buffers to our kernel, and enqueue it for execution

kernel.Arguments[0].SetValue(cl_input);
kernel.Arguments[1].SetValue(cl_output);

Enqueue and run the kernel. For this kernel, we will use a work-group size of 256.

env.cmdQueue.EnqueueNDRangeKernel(kernel, count, 256);

Force command queue to get processed, wait until all commands finish

env.cmdQueue.Finish();

Read back the results

float[] results = new float[count];
env.cmdQueue.EnqueueReadBufferAndWait(cl_output, results, dataSz);

That should be it. Now just print verify and print the results.

OpenCLwithGCN Project Files

The following is a list of the source files for this sub-project along with their description. 

  • OpenClEnvironment.cs (97) - This file contains classes related to holding the OpenCL environment.
  • OpenClWithGCN.cs (615) - This file contains the core functionality.
  • TextTemplate.cs (105) - This file contains a single static Expand() function. It is responsible for all the Text Template Transformations.

Asm4GcnGUI Windows Interface Sub-Project

Asm4GcnGUI provides a graphical user interface (GUI) for working with GCN assembly. It is a little easier to use then a simple notepad because it has syntax highlighting, code completion, and a few other extras. This app can be used for some quick kernels testing, playing around, or for educational purposes. The GUI also serves as an example on how to implement OpenCLwithGCN. When launched, it defaults to a simple example that can be used as a starting point for new projects. The interface has three windows: C# host code, kernel code, and a compiler output window at the bottom. 

In the host window, you will find the C# code that is run on the CPU. The host window has everything that would make up a normal C# application so it can be dropped into a C# project. To use OpenCL in C#, a wrapper is needed. The wrapper wraps C based OpenCL dll functions in C# compatible functions. I choose to use NOpenCL for this - a well written wrapper.

The next window is the GCN Assembly code window. This is the tab next to "C# Host Code." Normally __asm4GCN kernels would be with the rest of the host code, however the GUI splits this out so syntax highlighting can be achieved for the GCN assembly. Behind the scenes, the text in the C# host and GCN Assembly windows are merged and then compiled together. The text in the GCN Assembly window is simply placed into a .CS file with namespace GCN_NS { static class Code{ public const string DevCode = [GCN assembly Text here]}}”. This allows the GCN Assembly text to be combined with the main host program code. 

The final window is the compiler output. This shows any errors, warnings or messages that were generated during the compile process. There are two sets of messages shown in here. First to show, are the Asm4GCN assembler messages followed by any C# host code errors. The output of the program does not show in this window – it shows up in a separate console window.

Features of the GUI Interface

  • Directly Runnable in Visual Studio - well almost at least! I know you probably do not want to use a tiny IDE for long so in Asm4GCN, when files are saved, they are saved in a format that can be directly opened in Visual Studio. You might be thinking, what about the GCN Assembly? Well that is wrapped in its own namespace and string constant. The only step that really needs to be done is to add references. If the file structure is not modified to much it can still be opened back up in Asm4GCN.
  • Syntax Highlighting – For this project, I used Pavel Torgashov’s Fast Colored Text box. It is a nice control that that can be customized to work with custom languages such as GCN Assembly. It provides syntax highlighting, cold folding, hotkey support, bracket highlighting, undo, printing support, and the list goes on and on. The syntax highlighting helps in readability and also with errors. If a keyword is entered incorrectly it will not highlight.
  • Code completion – There is code completion for the GCN Assembly window. This helps with trying to remember what command you might be looking for. Again, thanks to Pavel Torgashov for this add-in.
  • Separate C# and GCN Assembly Code Windows – As mentioned already, this allows for language specific syntax highlighting. It also spits out the code in a logical way: CPU host code and GPU code.

General Topics on GPU Assembly

This section is more related to GPU assembly programming in general then it is about asm4GCN.

The Good & the Bad of programming in GPU Assembly

Programming directly in GCN assembly has its advantages and disadvantages. These are my personal opinions based on my past assembly programming experience.

Advantages

  • Potential for efficient and fast code – if a programmer knows what they are doing they can often create much faster code with a smaller memory footprint. I would estimate that a skilled hand written kernel is between 2x-4x better performance.
  • Assembly can take advantage of more hardware instructions and features. Some instructions are only accessible using assembly. In addition, some instruction options can only be utilized by using assembly. Finally, the GPU processor also has features, like special registers, that can only be accessed using assembly.
  • Toying with GPU assembly helps programmers understanding the inner workings of a GPU. This will help when writing Cuda or OpenCL kernels.
  • When programming in assembly, a human programmer can often stuff items in memory more efficiently than a high-level compiler. Humans are more creative than a compiler. =)

Disadvantages

  • Low-level languages, like GCN Assembly or nVidia’s PTX, take much longer to program in then higher-level languages such as OpenCL and Cuda.
  • Difficult to maintain – following an assembly function is not as easy as higher-level languages. Understanding what to change without breaking something can be time consuming. Usually small 5 to 20 liners do not have this issue as much.
  • Low readability – Assembly by nature is not meant to help with high-level readability. The only readability it provides is allowing binary code to be human readable. Assembly has little refactoring and abstraction and this makes it difficult to read and follow.
  • GPU Assembly can break with new generations of GPUs. Since it is designed for a particular chip, it can break with future generations of GPUs.
  • Depending on how GPU assembly binaries are loaded, they can break with driver updates. The OpenCLwithGCN sub-project is more sensitive to this because it needs to find a dummy binary so it can patch it with the GCN as
  • OpenCL kernels work across different architectures including AMD GCN, nVidia, x86 and x64. Assembly kernels are locked into the AMD GCN 1.x architecture. Luckily, AMD GCN architectures have deep development cycles so GCN is not constantly changing.
  • Assembly is more bug prone - Higher level languages are designed to minimize common programming mistakes.
  • When programing in assembly for performance, it can take some effort to win-out against a high-level compiler. Just because it is written in assembly, does not mean it is going to faster.

In general, based on personal experience, assembly is optimal for smaller time-critical sections of code. Large pieces of assembly can get complicated quickly if they are not well laid out. 1-50 lines of assembly is fine for critical sections but as a kernel gets more complicated, the compiler starts to have the upper hand because it can keep track of things better.

Preload Register Values

When a GCN kernel is lunched, some of the registers have pre-loaded values. For new users, this is a necessary list. Since the driver determines some of these it is possible that they can change in future driver releases.

Reg Name
s[2:3] UAV Table Pointer
s[2:3] +0x60 base_resource_const1(#T)
s[2:3] +0x68 base_resource_const2(#T)
s[4:7] Imm Const Buffer 0
s[4:7] +0x00 Grid Size
s[4:7] +0x04 Local Size
s[4:7] +0x18 Base Global ID
s[8:11] Imm Const Buffer 1
s[8:11]+0x00 param1 offset 
s[8:11]+0x04 param2 offset
s[8:11]+0x08 param3 offset
s12 Group ID
v0 Local ID

GCN Assembly Code Writing Tips

  • Plan Ahead – Maybe write your code in a higher-level language first - it is almost like pseudo-code. Writing a function in a higher-level language accomplishes a few items:
    • It helps work out the details and helps you understand what you are writing fully. For me, sometimes I think I know what is needed only to discover that I did not fully understand the problem. Or, I come to discover that the function will not even work. Figuring this out in assembly is not a good idea.
    • Coding your GPU function in OpenCL/Cuda first provides an outline to work from. 
    • It gets the mind mentally ready. I often find having to write a function a second time leads to a cleaner and more precise function.
  • Know the GCN ISA Manual - Go through the AMD GCN ISA programming manual to see what instructions are available that might aid in how things are done for your kernel.
  • Unroll Your Code to Prevent Jumps – jumps take extra compute cycles that can often be avoided.
  • Limit Kernel Size to Fit in Cache - Try to keep your kernels small enough to fit in the compute units share instruction cache, currently 32kb. (4000-8000 instructions) I learned this one from Realhet.
  • Limit Register Usage – The less registers that are used the more latency-hiding threads can run. One trick I have figured out to save on register usage is to move instructions up or down in your code to lower the overall register usage. Example: say on line 10, A and B are assigned values, then on line 20 they are first used with a “C=A+B“, and finally C is used on line 30. The register usage would be 2 variables from lines 10-20, and then 1 variable from lines 20-30. The total register usage can be written as (2 var * (20-10)) + (1 var * (30-20)) = 30. To save on registers we could move the “C=A+B” up to line 11. This minimizes the time two registers are used. Then our usage would become (2 var * (11-10)) + (1 var * (30-11)) = 21. Basically we just freed up a single register from lines 11-20. This trick only applies in assembly however because OpenCL and Cuda are smart enough to do this for you.
  • Just play and have fun with GCN assembly. The more you practice the better you will be.
  • Read online blogs, posts, and articles(like this one) =)

Limitations

  • Does not support 3rd Generation(Volcanic Islands) GPUs. [Radeon R9 280, Fury, & Nano]
  • Does not work with some driver versions (see system requirements)
  • Only compatible with AMD display adapters with GCN technology (see system requirements)

Future Wish List

  • Correct the issue where the same variable name cannot be reused.
  • GCN Generation 3 support
  • Additional OpenCL 2.0 support
  • Add compatibility for additional GPU driver versions.
  • Inline Assembly – This was an original goal but generating a dummy kernel for it was problematic.
  • Alterative friendly, easy to read, assembly statements. Instead of something like v_mul_i32_i24 varA, varB, varC maybe have something like varA = VarB * VarC. To do this we would need to know the types of the variable to select the correct instruction. This part is already in place.
  • Use variable type information (I,U,F,B) to throw warnings when used with non-matching instructions. i.e. using v4f with v_mul_i32_i24

Videos

I have uploaded at least one tutorial video. Please note that some parts of the video(s) may be out of date. It is not as easy to update a video like it is a document.

https://www.youtube.com/results?search_query=asm4gcn

System Requirements

  • An AMD Graphics card with GCN 1.0, and 1.1(Gen2) technology. 3rd Generation (Volcanic Islands) GPUs such as Radeon R9 280, Fury, & Nano are not supported.
  • AMD display driver versions 13.251, 14.501, and 15.200, 15.201, 15.300 & 16.150 are supported. Other versions may not work. 

History

  • Feb 16 2015 - Initial Public Release
  • Mar 1 2015 - General Fixes and Changes
    • Fix: Variables would always use register 0
    • Fix: Removed single __asm4GCN block limitation - There can now be many __asm4GCN kernels in one program.
    • Change: Parameter names removed - Since the Parameter names are not used having them there could be confusing. Function headers are now in the form: __asm4GCN myAsmAddFunc (float*,float*){...}
    • Change: merged #ref command into normal variable declarations. Since the #ref command is almost identical to normal variable declarations except that it specifies a register it is best to combine these. It is cleaner and less confusing. Instead of the format for a ref being #ref s8u myVar s[2:3] is now just s8u myVar s[2:3].
    • Improved: enlarged the autocomplete box - it now fits the code a little better.
    • Improved: Cleaned up example code.
    • Improved: Syntax Highlighting - it now highlights, labels, registers, and defines. It also highlights matching words.
    • Removed: Auto compile skip function. This function would skip a re-compile if there were no changes in the code windows. It was removed because it added complexity in the code and there was hardly any performance benefit since the compile process is so fast anyway. 
    • Added: ren command - A rename command has been added. This allows a variable to be renamed as its use changes.
    • Updated: Updated this article
  • April 22 2015 - Posted on GitHub
  • July 18 2015 - Fixed autocomplete and branches
    • Change: Switched the OpenCL wrapper to use NOpenCL by Tunnel Vision Laboratories. This is an awesome well-written wrapper.
    • Added: Indexing on variables. (i.e. myVar[1] would access the 2nd register in myVar)
    • Added: VINTRP encoded instructions
    • Fix: Fixed bug in SAPP encoding. Jumps were not working properly.
    • Fix: Auto-complete was not working on the GCN tab - it has been fixed..
    • Updated: Updated this article
  • August 2 2015 - worked on Variables
    • New: Variables automatically free themselves on the last line they were used.
    • New: Jumps can now be used in front of any statement and not just by line.
    • Refactor: Added variables class and moved variable functionality into variable class
    • Refactor: reworked how jump functionality
    • Refactor: everything used to be done in one-pass and now it is in two-passes
      • 1) Pass 1 - read in all statements and record positions of vars
      • 2) Process automatic variable freeing
      • 3) Process register assignments for variables
      • 4) Pass 2 - convert statements to bin
    • Removed: Removed the 'ren' function as it resulted in ugly code. The same can be accomplished by declaring vars with specified registers.
  • August 9 2015 - General Updates
    • New: Inline variable decorations (e.g. v_mov_b32 v4u myNewVar, anyVar )
    • New: multiple labels intermixed with statements can now be added to a single line
    • New: Freed Variable registers can now be re-used in the same instruction.
    • New: Lines ending with '/' will append the following line. #defines and statements can be split across lines.
    • New: Cleaned up initial code and added a #define(...) to easily view any S or V variable.
    • New: Added Ctrl-Y as a redo operator for Visual Studio users.
    • New: Additional variable warning checking (like when it is never used or used only once)
    • New: When declaring variables an existing variable, with an option index, as the register you want \
      to re-use.
    • Change: Spaces can no longer be used to separate operands. Only commas can be used. This does not apply to trailing instruction option parameters.
    • Change: Defines are now processed before labels.
    • Change: Defines are now processed in reverse order so defines replacements can contain previous #define replacements. 
    • Fix: multi-register variables were not always aligned properly.
    • Fix: Fixed issue with variable indexes 
    • Fix: syntax color highlighting issue on GCN tab where not all text was always highlight properly.
    • Fix: Did some minor fixes and adjustments in the instruction encoder.
    • New: Added unit tests that double as examples.
    • New: Added example: A fast wavefront sum reduction using 18 instructions and no shared memory.
  • Nov 2015

    • New: Added a friendly syntax feature so that asm statements can be entered in a more readable format. Example: "v_add_i32 localSizeIdx, vcc, laneId, vLocalSize" can now be added like "localSizeIdx = laneId + vLocalSize". It is easier to read. It is only supported on +,-,*,>>,<<.
    • Refactor: Pulled out some of the Regular Expressions to their own file. This will be expanded on in the future.
  • Jan - April 2016 
    • Update; Added prompt for kernel export.
    • Update: Added driver versions 15.300 and 16.150 to acceptable list; code cleanup.
    • Fix: Asm4GCN would fail if there was more than one AMD GCN GPU. It will now work but Asm4GCN will not work with both GPUs without adjustment
    • Update: Added a support for a newer driver version; add a "vcc" for the friendly converter for "+"
    • Fix: Corrected opcode for DS_WRITE_SRC2_B32=141 and DS_WRITE_SRC2_B64=205 (Thank you to Mateusz Szpakowski for finding that and correcting it.)

Other GCN Assemblers

cmingcnasm - cmingcnasm is a GCN minimal assembler written in c for GCC on a Linux platform. It was created by Sylvain Bertrand. I have not played with this library but I would imagine that is fast since it is written in c. Links:  GitHub,  GoogleCode

gcnasm - Created by Daniel Bali, is an open source GCN assembler that is written in c. The code is written efficiently using low-level c so it is lighting fast. It uses the GCC compiler in Linux but can easily be modified to work in a windows environment because of its standard c includes. Many of the ideas and code references on how to build Asm4GCN came from gcnasm. Links:  GitHub,  OpenWall

HetPas Assembler (windows)- Written by Realhet, HetPas has been the only directly runnable GCN assembler for windows up to this point. I have used HetPas’s assembler for a couple of years now. It’s good for creating binary kernels or just playing around with GCN Assembly. The program implements a full and feature rich GCN Assembler using a Pascal like language for host code. It does require "Data Execution Prevention" to be turned off in windows and there are instructions on how to do this. HetPas has been updated several times and is getting more and more feature rich with each release. Recently variables were added as well. 

A Special Thanks To....

AMD for making their GCN ISA manuals available. I spend many hours combing through the ISA documentation and referencing their data tables. 

Daniel Bali, who built an excellent open source GCN assembler. This was my first compiler so I was looking for ideas on how to start. Daniel’s project gave me ideas on how I could tackle this. I took away some important concepts on how to build an assembler from his project.

Derek Gerstmann for his easy to follow and complete OpenCL example. Derek's OpenCL example is the default example that opens in Asm4GcnGUI. It has been converted to C# for NOpenCL.

Pavel Torgashov for the FastColoredTextBox editor and the Autocomplete menu. These controls add syntax highlighting and code completion richness to the GUI interface. 

Realhet, who built a fully functional and feature rich assembler for windows called HetPas. Much of my GCN assembly skills came from toying with Realhet’s assembler and from reading his posts. I have been a Realhet fan for a while. He has made many insightful posts on AMD forums and on his WordPress site.

Tunnel Vision Laboratories for providing the very slick OpenCL wrapper called NOpenCL. It was written by Sam Harwell.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here