|
Processors must be backward-compatible, so each processor supports the extensions of the previous ones.
Here are some SSE-related links I've found (the Intel link describes their C/C++ compiler):
http://www.tommesani.com/Docs.html
http://www.codeproject.com/KB/recipes/BubbleSortWithSSE2.aspx
http://softwarecommunity.intel.com/isn/downloads/softwareproducts/pdfs/347602.pdf
http://www.hayestechnologies.com/en/techsimd.htm#SSE2
|
|
|
|
|
Thanks Alan for your answer.
I did also some research on the internet in the mean time.
I can answer to my own question :
yes, Intel C++ compiler supports SSE instructions. It parallelizes and compiles directly some code into SSE instructions in a process called autovectorization.
For those that might be interested, follow the link "Product in depth" in the page :
[^]
Microsoft Visual Studio 2008 does not provide that.
Jean-Marie
|
|
|
|
|
epitalon wrote: Microsoft Visual Studio 2008 does not provide that.
Rather than ask the compiler to guess what type of calculation you're performing and hope it is able to optimize it for you, programmers should just stop being lazy and do it properly: align your variables/structures to 16-byte boundaries, parallelize your algorithm (fine-tune it) by hand, and then use compiler intrinsics (supported by MS, Intel and GNU compilers). Better yet, use something like SSEPlus so that your app chooses the most appropriate instruction set to use for the CPU its running on. That way you can be sure you're getting the best performance for speed-sensitive calculations.
link[^]
|
|
|
|
|
Thanks Mike for your answer.
I understand that I don't need the compiler to decide when to parallelize for me. But I don't know the syntax I should use to program parallel code and compile it into SSE instructions.
By the way, I have Visual Studio 2005. Is the feature you are talking about availlable in VS2005 or should I upgrade to VS2008 ?
What are the "compiler intrinsics" you are talking about ?
Regards,
Jean-Marie
|
|
|
|
|
epitalon wrote: What are the "compiler intrinsics" you are talking about ?
That wasn't me, that was a quote from the discussion I provided the link to. You should follow the links that people provide in replies.
Here is another[^]
|
|
|
|
|
can you explain to me how to calculate the offset in row-major and column-major order when there are 3 or 4 dimensions involved?
|
|
|
|
|
Hi,
I can't explain it any better than this[^].
Luc Pattyn [Forum Guidelines] [My Articles]
- before you ask a question here, search CodeProject, then Google
- the quality and detail of your question reflects on the effectiveness of the help you are likely to get
- use the code block button (PRE tags) to preserve formatting when showing multi-line code snippets
|
|
|
|
|
the formula given on that page doesn't work right.. Especially if each item takes up more than one byte or if there are more than two dimensions... My book does it a totally different way, I just wish there was some kind of chart that I could check to make sure my program was calculating the row major / column major order correctly because right now i think i have it working but i'm not sure if my logic is correct, i need something to check it with!
|
|
|
|
|
OK, for an array with dimensions (d0, d1, ... d8, d9) the offset of element (e0, e1, ... e8, e9) is either
e9 + d9*( e8 + d8*( .... + d2*( e1 + d1 * e0)...)) which does not use d0
or
e0 + d0*( e1 + d1*( .... + d7*( e8 + d8 * e9)...)) which does not use d9
depending on the scheme you want to use
AFAIK that corresponds to the formulae in Wikipedia.
Warning for VB users: the dimensions indicate the number of elements in a rank, not the maximum index
Luc Pattyn [Forum Guidelines] [My Articles]
- before you ask a question here, search CodeProject, then Google
- the quality and detail of your question reflects on the effectiveness of the help you are likely to get
- use the code block button (PRE tags) to preserve formatting when showing multi-line code snippets
|
|
|
|
|
I have a MASM routine that generates assembly code to implement a Quicksort. All is working, including support for embedded keys in records, support for signed/unsigned, big-endian/little endian, keys of BYTES/WORDS/DWORDS/FLOATS/DOUBLES/EXTENDED_FLOATING, null terminated strings for BYTES or WORDS (element count=0), multiple elements for any data type (counts from 1-2048), multiple data types for a secondary key including its use as a tie-breaker for equal primary keys to make the QuickSort Stable. I will add support for MMX and XMM keys, but the current quest is to add the capability to declare Ascending or Descending in any mixture for either the primary key or for the secondary key (one can be ascending, the other descending, or both the same - either ascending or descending). I could just invert the conditional transfers and keep the comparison order the same, or invert the comparison order and keep the conditional transfers the same, or finally, just swap the indexes at entry and restore them at exit (XCHG) and keep all of the rest of the extensive code the same.
Note, the source code is extensive, but the macro invocation conditionally assembles only what is required and there is not a lot of checking what should be done next. Inserting the XCHG's for each comparison would add code (small) and execution cycles (small), but the compare logic is the highest use code of the whole sort. Conditionally assembling a version for ascending order and a different version for descending order is basically a copy and paste with an exchange of either the indices (and has no extra code or cycles), or a change of the conditional transfers. It seems that maintaining the two copies in parallel might be dangerous - fixes to one set of code would have to always be applied to the other set, and we all know how often that strategy fails - duplicate versions aren't!
So the question is, should I swap the indices and which way should I do this - swap the indices or swap the comparison order, or should I create two versions, one for ascending and the other for descending?
Please feel free to jump in with both feet and kick up a little dust, but this is a Quicksort so I don't want to hear about C++ or C# and overloaded operators with all of the bloat that goes along with HLL.
Hmm, I just thought of a way to allow both the primary key and the secondary key to have null terminated strings. The lowest offset of ether key has to end at the higher offset, and the highest offset has to end at the end of the record. More goodie to add, but carefully since the secondary key logic was a copy of the primary key logic, with different parameters, and I already stripped out the support for a null terminated string (more than one line). Remember that bit about duplicate versions that aren't? Oh, well!
Dave.
|
|
|
|
|
Hi,
Wow. Quite a question.
not sure why you are doing all this in assembly, but whatever the language, this would be my approach:
for each comparison, calculate a difference, which is negative if object1<object2, zero if equal, and positive-non-zero when object1>object2
then multiply that difference by a factor which is +1 for ascending and -1 for descending. The result indicates precede (negative), don't know (zero), or follow (positive-non-zero).
When you have multiple sort criteria, handle the primary one first, if that results in "don't know", continue with the next criterion.
[EDITED because of HTML eater]
Luc Pattyn [Forum Guidelines] [My Articles]
- before you ask a question here, search CodeProject, then Google
- the quality and detail of your question reflects on the effectiveness of the help you are likely to get
- use the code block button (PRE tags) to preserve formatting when showing multi-line code snippets
modified on Tuesday, April 7, 2009 9:14 PM
|
|
|
|
|
Luc,
Thank you for the reply.
And I was worried about adding two XCHGs, and you want me to put in a MUL? OMG!
Actually, you can get the same effect for descending by just changing the ordering of the compare arguments, i.e. compare object2 to object1. This is exactly what C++ does for an IF, it keeps your comparison code (i.e. >=, generates a jae or jge depending on sign) but reverses the compared values and skips around the body of the IF if it is not >=. This, however, leaves the code much different than the code for an ascending sort, somewhat harder to maintain and enhance. That was why I was thinking about XCHG'ing the indices and keeping the same code for both sort types.
Another difficulty with that approach is that floating compares do not affect the usual flags, only the flags in the FPU. You have to extract the FPU flags and make special compares to effect a ja or a jb, and the tests are of a TEST nature with a je or jne, not a ja or jb. I have all of this working, and I am now testing the tie-breaker/secondary_key code.
Dave Augustine.
|
|
|
|
|
if you don't want to multiply by +1/-1 you can conditionally negate the difference, that's bound to be cheaper than two conditional XCHG. However a single MUL is probably the fastest on any Pentium or better thanks to having separate integer units and needing no unpredictable change of program flow.
An alternative is self-modifying code where you execute either a NOP or a NEG on diff.
Luc Pattyn [Forum Guidelines] [My Articles]
- before you ask a question here, search CodeProject, then Google
- the quality and detail of your question reflects on the effectiveness of the help you are likely to get
- use the code block button (PRE tags) to preserve formatting when showing multi-line code snippets
|
|
|
|
|
Luc,
Thank you again. When I get done, I will post (if not too big) or send you some zips of an ascending and a descending DWORD sort and the same for float - just the actual quick sort code.
Dave.
|
|
|
|
|
Luc,
I have completed development on my Quicksort. The following is the timing for the sort using just an array of DWORDS. The rest of the tests work as planed, ascending/descending, signed/unsigned, big-endian/little-endian, two different sort keys allowed, BYTES/WORDS/DWORDS/Floats/Doubles/extended_floating, BYTE or WORD null terminated strings, up to 2048 elements per key, fully automated test and tracing environment to verify results.
Do you have a feel for what a good timing benchmark might be for this size of a sort?
The following timings are only for the actual sorting, not the setup. The data
is prepared, start time taken, sorted, end time taken, time difference is
accumulated, loop back 1023 more times.
The random data originally was created by randomly selecting a number from the
set, saving the number, shrinking the set, and recalculating a new random number
on a decremented range. This was taking too long to shrink up a 1024*1024 DWORD
buffer for each extraction. The method was changed to just create a random
number in the range of 0 to ((1024*1024)-1), save the random number, decrement
the range, and calculate another random number for the decremented range. Since
there are potentially many duplicates, and probably not all numbers are present,
the checking was changed from comparing the sorted array against a known
ascending or descending array, to individually comparing each DWORD pair
(comparing element n to element n+1) and verifying that it is not above (if an
ascending sort) or not below (if a descending sort).
Note: The sort completed with only a 4096 byte stack, sorting 1024*1024 DWORDS.
Building Ascending data.
Building Descending data.
1024*1024 Ascending DWORDS sorted 1024 times Ascending, time in msec: 37144, good compare. 36.27 msec/iteration.
1024*1024 Ascending DWORDS sorted 1024 times Descending, time in msec: 46130, good compare. 45.05 msec/iteration.
1024*1024 Descending DWORDS sorted 1024 times Ascending, time in msec: 47689, good compare. 45.57 msec/iteration.
1024*1024 Descending DWORDS sorted 1024 times Descending, time in msec: 53870, good compare. 52.61 msec/iteration.
Building Random data.
1024*1024 Random DWORDS sorted 1024 times Ascending, time in msec: 158389, good compare. 154.67 msec/iteration.
1024*1024 Random DWORDS sorted 1024 times Descending, time in msec: 141898, good compare. 138.57 msec/iteration.
I do have code samples (.LST output) if you are interested.
Dave.
|
|
|
|
|
I don't fully understand your test cases, however here are two questions you can ask yourself:
1. why is there a big difference between "Ascending DWORDS sorted 1024 times Ascending" and "Descending DWORDS sorted 1024 times Descending"?
2. how does the timing compare with a regular quicksort as is offered in some libraries?
|
|
|
|
|
Luc,
What's with the membership change? After 5 years and 1,000,000 posts do they take away your Gold Membership? You must have been a very naughty boy.
The timing differences, I believe, are just normal Windows getting in the way of everything. I mean, my development machine is not connected to the internet, it is only doing one thing (executing my program), so why is there all this activity as indicated by the Task Manager? Paging with 4 million bytes involved probably plays a part in it also.
I will execute the two tests again with tracing enabled and pick some shorter tests (my standard testing is the character set A-Z for easy visual checking of results). The tracing will report the number of compares and exchanges, as well as report all of the exchanges as they occur. My algo starts by swapping the middle element for the right end (the pivot) and then finally determines that there is nothing to do (because it is already in order) so it just re-swaps the middle and end, but then it sorts the front half and then sorts the back half, both of which are in order so the same thing happens again twice, then 4 times... If the sort has to reverse the order (ascending data sorted descending), then after the first swap to put the middle at the end, it will end up swapping end elements, working to the middle (since everything is on the wrong side of the pivot), and ends up with a re-swap of the middle and the end.
Dave.
|
|
|
|
|
Hi Dave,
Don't worry, I'm using a new account to avoid all the crap gold members get to digest on the home page.
And its not 1 million posts, even CG does not reach 100K!
If all numbers are ascending and that is what you don't want, everything gets swapped.
If all numbers are descending and that is what you don't want, everything gets swapped.
I expect these two test cases to have the same speed. I understood they did.
If all numbers are ascending and that is what you want, no swaps occur.
If all numbers are descending and that is what you want, no swaps occur.
I expect these two test cases to have the same speed. I understood they did not, they were quite different, and by much more than some Windows unpredictability could explain.
Member 4194593 wrote: all this activity as indicated by the Task Manager? Paging with 4 million bytes involved probably plays a part in it
Huh. What is your hardware? how could 4MB be cause for paging?
If this is a miniature PC, try the same code on a real one!
|
|
|
|
|
Luc,
I found an error in the Descending Descending where I had supplied the ascending
data instead of the descending data, and I corrected that. I also modified the
timing procedure to create all of the data first, then execute the tests, and
then added a loop to run all of the tests 10 times. I added code to call
SetPriorityClass to set the priority to HIGH_PRIORITY_CLASS. I added code to
put the start of each procedure on an aligned 16 location to make sure that the
same code in the ascending and descending routines stayed on the same relative
boundaries to avoid differences in the hardware branch predictions - this would
only cause a 1 or 2 cycle difference in time, but with 1,000,000,000 *
1,000,000,000 compares it will make a difference - note that something as small
as a 26 record sort takes 109 compares and 14 exchanges.
I put the loops for the left scan and the right scan on 16 BYTE boundaries to
aid in the hardware branch prediction.
I modified the code to clean up the floating jump on condition tests (had to
re-order my register usages to free up eax, then debug and re-test).
I modified the test code to add a timing test for floats.
The following are the results from the first two iterations of the tests (almost
the same times as for DWORDS):
uilding Ascending data.
Building Descending data.
Building Random data.
Floating tests DWORD tests
1024*1024 Ascending Floats sorted 1024 times Ascending, time in msec: 32359, good compare. 32068, good compare.
1024*1024 Ascending Floats sorted 1024 times Descending, time in msec: 36407, good compare. 37310, good compare.
1024*1024 Descending Floats sorted 1024 times Ascending, time in msec: 36749, good compare. 37276, good compare.
1024*1024 Descending Floats sorted 1024 times Descending, time in msec: 36674, good compare. 38611, good compare.
1024*1024 Random Floats sorted 1024 times Ascending, time in msec: 130539, good compare. 132599, good compare.
1024*1024 Random Floats sorted 1024 times Descending, time in msec: 131883, good compare. 131187, good compare.
1024*1024 Ascending Floats sorted 1024 times Ascending, time in msec: 32005, good compare. 32005, good compare.
1024*1024 Ascending Floats sorted 1024 times Descending, time in msec: 37254, good compare. 36943, good compare.
1024*1024 Descending Floats sorted 1024 times Ascending, time in msec: 37126, good compare. 35773, good compare.
1024*1024 Descending Floats sorted 1024 times Descending, time in msec: 37446, good compare. 37359, good compare.
1024*1024 Random Floats sorted 1024 times Ascending, time in msec: 130639, good compare. 133531, good compare.
1024*1024 Random Floats sorted 1024 times Descending, time in msec: 130431, good compare. 131447, good compare.
Dave.
|
|
|
|
|
OK, this looks good. The timing anomaly has disappeared. Well done.
|
|
|
|
|
Luc,
It was an interesting search. The only "mistake" was the use of ascending data instead of descending data for the test. The rest of the changes were to avoid the branch prediction problems - keeping all of the tight loops on a 16 byte boundary and keeping the loops down to 16 bytes or less.
Dave.
|
|
|
|
|
Hi all
I would be very happy, if anyone could provide me with a algorithm to rotate a bitmap around it's center. The problem is that it has to run in .NET micro v3.0(otherwise I would have just used that very nice Graphics that exist in full version of .NET) in which they removed the Bitmap.RotateBlt method. Speed is not a(major) concern.
|
|
|
|
|
Hi,
without a method that offers free rotation, you would have to do it yourself, along these lines:
foreach row y2 in destination image
foreach column x2 in destination image
calculate corresponding location (x1,y1) in source image
copy from source image at (x1,y1) to destination image (x2,y2)
}
}
without translation, the rotation formulas would be something like
x1 = x2 * cosa - y2 * sina
y1 = x2 * sina + y2 * cosa
[EDIT: fixed a typo /]
where cosa and sina are the (constant) cosine and sine of the rotation angle.
You would need to add some constant to x1 and y1 to choose the center of rotation.
Luc Pattyn [Forum Guidelines] [My Articles]
- before you ask a question here, search CodeProject, then Google
- the quality and detail of your question reflects on the effectiveness of the help you are likely to get
- use the code block button (PRE tags) to preserve formatting when showing multi-line code snippets
modified on Monday, April 6, 2009 11:25 AM
|
|
|
|
|
Hi thanks for the reply
I'm having trouble making the center of rotation the center of the image. If I choose the constant to add to x1 and y1 to be 32(i'm using 64x64 pixel bitmaps), part of source bitmap is not copied.
What am I doing wrong?
in advance, Thx a bunch
|
|
|
|
|
Hi,
if you want to rotate an image around a point C that is at (CX1,CY1) in the source image and at
(CX2, CY2) in the destination image, then obviously you have to translate the coordinate system in both images like so:
(x1-CX1) = (x2-CX2) * cosa - (y2-CY2) * sina
(y1-CY1) = (x2-CX2) * sina + (y2-CY2) * cosa
Luc Pattyn [Forum Guidelines] [My Articles]
- before you ask a question here, search CodeProject, then Google
- the quality and detail of your question reflects on the effectiveness of the help you are likely to get
- use the code block button (PRE tags) to preserve formatting when showing multi-line code snippets
|
|
|
|
|