SPO600.asm

Tuesday, 31 March 2015

Switching projects

Project Switch

Project Progress: ~~PERL5~~ PCRE / Stack-less Just in time compiler

This is an update about my project progress in SPO600, I have switched projects and made some steps forward.

Perl5

In my previous post I ruled out some areas in Perl5 and was heading towards the regular expression engine as an area to optimize. Well I got a response from the mailing list saying the regular expression engine should be fine and does not really need any simple optimizations. I was kind of stubborn in thinking I could find something in perl and I kept looking when nothing obvious was there. In hindsight I should have stopped and tried out different packages sooner.

PCRE and sljit

After a suggestion from my professor I have looked at PCRE(Perl compatible regular expressions) This small library deals with parsing regular expressions using similar semantics as perl. I downloaded the source via subversion

svn co svn://vcs.exim.org/pcre2/code/trunk pcre

and found a folder in the source called sljit. There are many files in that directory with architecture specific code. They have some for x86, some for ARM, some for Sparc and more, this peaked my interest so I looked into it. I found that sljit is a stack-less just in time compiler which is cpu independent, more info here. Specifically in pcre it is part of a pcre performance project which uses sljit to improve the pattern matching speed of pcre.

Moving forward

I have just found all this recently, so for now I will be looking into any functions that can be ported over to another architecture or any areas that look promising for optimization. I hope to soon zero in on one particular location, and actually get something done.

Thanks for reading

Tuesday, 24 March 2015

Project updates

Project Progress

Project Progress: PERL5

This is an update about my project progress in SPO600

Ruled out

In my previous post I mentioned three areas, The tail call optimization, Regex super - linear cache and inline assembly. After contacting the perl5 community on irc it seems that the tail call optimization portion should not be there, therefore it has been ruled out. I mailed the mailing list about any suggestions regarding the inline assembly portion and any other ideas they had but I have not received word back as of now. I kind of ruled out this assembly code for now unless further information comes up to suggest it has potential.

Moving on

In proceeding with this project I have not narrowed down or progressed as much as I would have liked to by this time, I am just spinning my wheels. For now I will work at understanding how I could improve the regex engine while also exploring other packages to see if I can find some more straightforward things to accomplish. Ideally I would like to find places I could use inline assembly optimizations or compiler intrinsics, both of which we have been looking at recently in the SPO600 course.

Sunday, 15 March 2015

Project Progress

Project Progress: PERL5

This post is about a project I am starting in my SPO600 class that requires me to optimize a portion of the lamp stack.

Areas to optimize

I have chosen Perl5 as the package that I will be working on and have found a few areas that may be good areas to optimize and make changes.
The first two areas I found using the perl todo list.
The first was tail call optimization:
Seen at line 1130.
This would essentially have me find areas where tail call optimization is possible and rewrite them to implement it. Here is a link that explains what tail call optimization is: TCO

The second area I found was in regards to Perls regular expression engine.
Seen at line 1154
In their engine certain regular expressions end up taking exponential time. They have a workaround for this called super-linear cache but they say the code has not been well maintained and could use improvement. I found the location of this problem in the source by grepping for the keyword 'super-linear', found at regexec.c.
It seems like this could be an area for optimization although I am not very confident about how I would attempt this or what I would change to improve it because I do not have a strong knowledge of how a regular expression engine works.

The final one I found by looking through the perl 5 git repository(Instructions here), I found some sections of inline assembly by grepping for the keyword 'asm' using 'grep -r asm ./*' these sections were in a file called os2.c:
my_emx_init()
my_os_version
These functions may potentially be able to be ported to aarch64 syntax. I am a bit uncertain about this area because I am not sure what this code does or if it is important or not.

Why Perl?

I chose Perl for my project because the community seems really clear and organized. They have a todo list with various tasks which is very useful and as you can see above it helped me a lot with regards to finding areas to work on. Also they have a very active community, In their mailing list archive, there are daily messages which makes me confident that if I need help or need to ask a question I won't be waiting for extended periods of time.

Proceeding

Looking at my 3 options I believe the tail call optimizations might have a large impact depending on how many areas I can find. I would like to implement some code involving the aarch64 platform because that would relate to the SPO600 course the most but I am uncertain about the inline assembly code that I have found so far. The regular expression area seems really interesting, but I am afraid it would not be feasible given the time I have, it is something I will definitely consider if my project doesn't go as planned.
Proceeding with this project I plan on starting to work out how to apply the tail call optimizations while I engage with the upstream community about which direction is the best for them and for me. I also plan on benchmarking Perl on x86 and aarch64 to see if I can find any further areas or functions that may let me perform a platform specific optimization.

Perl Upstream

Perl has a relatively straightforward guide on their website here.
To summarize, if you have a patch either use perlbug or send it to perlbug@perl.org. Once the patch has been processed it will be posted on the mailing list for discussion, you are encouraged to join the discussion and promote your patch. They recommend using git, You can get the source by using 'git clone git://perl5.git.perl.org/perl.git perl', Once you make changes you can use git diff to make a patch, this compares your branch and the main branch to produce the patch.

Conclusions

This project has made me the most nervous of any project I have had so far. It is filled with uncertainties, a couple weeks ago I was uncertain I would even find anything to work on but eventually I did. Now I am uncertain on which direction to go and whether or not my contributions will be accepted. Regardless of what happens it is a great learning experience and I now appreciate the complexity of large projects like Perl or other packages in the lamp stack.

Monday, 2 March 2015

Device Access

This post is regarding a presentation that I did in SPO600 where I talked about device access in low level languages.

Access using I/O Ports

I/O ports: These look just a memory cell to a computer but they connect any data written or read from it to a device that is connected to the computer.

  MOV DX, 0487 ; Port number.
  MOV AL, 1    ; Number to write to the port.
  OUT DX, AL   ; Write to the port.
  IN DX, AL    ; Read from the port.

In this case DX takes in the port number and AL takes in the number you are writing to the port. OUT will write the number stored in AL to DX and IN will read from the device and store results in AL.
Danger:
Accessing devices in this way is dangerous. Every port cannot be accessed in the same way, Some ports are read only, and some are write only and some are both. If you don’t know exactly what port you are using you risk damaging any device that is accessing that port.

Access using Interrupts

An alternative method is using software interrupts to gain access to a device, This is often used in more complex devices such as the mouse in order to simplify accessing it.

  MOV AX, 0 ; Access subfunction 0 of int 33h 
  INT 33    ; Make the interrupt call

In the Example we use Interrupt 33 which provides access the mouse, and moving a number into AX will allow you to access a specific function of the mouse (function 0 returns a value which indicates if a mouse has been detected/installed). There are many, many more sub-functions available here that allow you to access many different parts of the mouse.

Platform / architecture Issues

I/O and device instructions are very processor dependent, due to the details of how a processor moves data in and out, most of the source code is coded specifically for a platform. For example in the linux kernel they have many different files including io.h that are stored in folders specific to the architecture.
Arm: io.h
x86: io.h

Conclusions

I was unable to find to much information about this topic and Chris Tyler, my professor for SPO600, pointed out to me that this is because all the device access SHOULD be handled by the operating system unless you are writing for device drivers or doing embedded programming.

Resources

1 A portion of a book explaining IO ports and interrupts
2 A tutorial explaining IO ports and software interrupts
3 One chapter of a book about device driver programming detailing IO ports and how they are used with devices

Thanks for reading

Saturday, 28 February 2015

Compiled C lab

Compiled C Lab

This post explores some of the gcc compiler options and how they affect the assembly code that gcc creates.

Files Used

Further in this post I will be referencing some simple files, here is the original c source:

//hello.c
#include <stdio.h>

int main(){
  printf("Hello Everybody\n");
}

//hello2.c
#include <stdio.h>

 int main(){
   printf("Hello Everybody , %d, %d, %d, %d, %d, %d, %d, %d, %d, %d \n",1,2,3,4,5,6,7,8,9,10);
 }

//hello3.c
#include <stdio.h>

 void output(){
   printf("Hello Everybody , %d, %d, %d, %d, %d, %d, %d, %d, %d, %d \n",1,2,3,4,5,6,7,8,9,10);
 }
 int main(){
   output();
 }

Objdump

I will be using these three commands with objdump in order to analyze the gcc output file.

objdump -f [filename]
This shows you header information, such as the architecture it was compiled for, file name and more.
objdump -s [filename]
This shows you detailed section information.
objdump --source [filename]
This disassembles your source code and shows you it, alongside the assembly instructions generated.

Compiling and analyzing

We will be compiling the source code with a number of different options and analyzing it with the objdump options shown above.

First compile: gcc hello.c -g -O3 -fno-builtin -o origin

Using objdump -f gives us header information. This information changes minimally within one computer, but as soon as you start moving to different computers and different architectures it will change a lot. This is the output:

 origin:     file format elf64-x86-64
 architecture: i386:x86-64, flags 0x00000112:
 EXEC_P, HAS_SYMS, D_PAGED
 start address 0x0000000000400400

Using objdump -s gives us valuable section contents. For example the "Hello Everybody" string is stored in the section .rodata.

 Contents of section .rodata:
 4005d8 01000200 48656c6c 6f204576 65727962  ....Hello Everyb
 4005e8 6f64790a 00                          ody..

Using objdump --source gives us our source code alongside assembly instructions. In this example we can see that our source code is stored within a .text section and inside a main function.

  000000000040050c <main>:
  #include <stdio.h>
   
  int main(){
    40050c: 55                    push   %rbp
    40050d: 48 89 e5              mov    %rsp,%rbp
    printf("Hello Everybody\n");
    400510: bf dc 05 40 00        mov    $0x4005dc,%edi
    400515: b8 00 00 00 00        mov    $0x0,%eax
    40051a: e8 c1 fe ff ff        callq  4003e0 <printf@plt>
  }
    40051f: 5d                    pop    %rbp
    400520: c3                    retq
    400521: 90                    nop
    400522: 90                    nop
    400523: 90                    nop
    400524: 90                    nop

Second compile: gcc hello.c -g -O0 -fno-builtin -static -o step1

Adding -static.
Overall the section size and the source is much larger than the original because static adds all the linked code from various libraries in C.
As you can see below, after outputting the objdump results to files, the --source file is about 700 times larger and the section information is about 200 times larger than the original.

    9399 Feb 25 12:11 origin-src.txt
   10865 Feb 25 12:11 origin-s.txt
     142 Feb 25 12:14 step1-f.txt
 7103611 Feb 25 12:15 step1-src.txt
 2723639 Feb 25 12:14 step1-s.txt

Third compile: gcc hello.c -g -O0 -o step2

Removing -fno-builtin.

 000000000040050c :
 #include 
  
 int main(){
  40050c: 55                    push   %rbp
  40050d: 48 89 e5              mov    %rsp,%rbp
  printf("Hello Everybody\n");
  400510: bf cc 05 40 00        mov    $0x4005cc,%edi
  400515: e8 c6 fe ff ff        callq  4003e0 
 }
  40051a: 5d                    pop    %rbp
  40051b: c3                    retq

This changes the function call by removing some builtin function optimizations,for example instead of calling 'printf' it calls 'puts'

Fourth compile: gcc hello.c -O0 -fno-builtin -o step3

Removing -g

Contents of section .got.plt:                                 Contents of section .got.plt:
 6008c8 e0066000 00000000 00000000 00000000  ..`.............  6008c8 e0066000 00000000 00000000 00000000  ..`.............
 6008d8 00000000 00000000 e6034000 00000000  ..........@.....  6008d8 00000000 00000000 e6034000 00000000  ..........@.....
 6008e8 f6034000 00000000                    ..@.....          6008e8 f6034000 00000000                    ..@.....        
Contents of section .data:                                    Contents of section .data:
 6008f0 00000000 00000000 00000000 00000000  ................  6008f0 00000000 00000000 00000000 00000000  ........ ........
Contents of section .comment:                                 Contents of section .comment:
 0000 4743433a 20284465 6269616e 20342e37  GCC: (Debian 4.7    0000 4743433a 20284465 6269616e 20342e37  GCC: (Debian 4.7
 0010 2e322d35 2920342e 372e3200 4743433a  .2-5) 4.7.2.GCC:    0010 2e322d35 2920342e 372e3200 4743433a  .2-5) 4.7.2.GCC:
 0020 20284465 6269616e 20342e34 2e372d33   (Debian 4.4.7-3    0020 20284465 6269616e 20342e34 2e372d33   (Debian 4.4.7-3
 0030 2920342e 342e3700                    ) 4.4.7.            0030 2920342e 342e3700                    ) 4.4.7.        
Contents of section .debug_aranges:                           >
 0000 2c000000 02000000 00000800 00000000  ,...............   >
 0010 0c054000 00000000 15000000 00000000  ..@.............   >
 0020 00000000 00000000 00000000 00000000  ................   >
Contents of section .debug_info:                              >
 0000 91000000 02000000 00000801 53000000  ............S...   >
 0010 01690000 00120000 000c0540 00000000  .i.........@....   >
 0020 00210540 00000000 00000000 00020807  .!.@............   >
 0030 00000000 02010871 00000002 02074000  .......q......@.   >
 0040 00000204 07050000 00020106 73000000  ............s...   >
 0050 0202055f 00000003 0405696e 74000208  ..._......int...   >
 0060 057f0000 00020807 88000000 0201067a  ...............z   >
 0070 00000004 01910000 00010357 0000000c  ...........W....   >
 0080 05400000 00000021 05400000 00000000  .@.....!.@......   >
 0090 00000001 00                          .....              >
Contents of section .debug_abbrev:                            >
 0000 01110125 0e130b03 0e1b0e11 01120110  ...%............   >
 0010 06000002 24000b0b 3e0b030e 00000324  ....$...>......$   >

Using the linux command diff -y [filename1] [filename2] we can compare these two files, in the left file which is the original we can see that the section contents are filled with many sections prefixed by .debug, this is because the -g option adds debugging information. When we remove -g all the debugging information is gone.

Fifth compile: gcc hello2.c -g -O0 -fno-builtin -o step4

Changing C file to hello2.c which has additional arguments in printf.

000000000040050c :
#include 

int main(){
  40050c: 55                    push   %rbp
  40050d: 48 89 e5              mov    %rsp,%rbp
  400510: 48 83 ec 30           sub    $0x30,%rsp
  printf("Hello Everybody , %d, %d, %d, %d, %d, %d, %d, %d, %d, %d \n",1,2,3,4,5,6,7,8,9,10);
  400514: c7 44 24 20 0a 00 00  movl   $0xa,0x20(%rsp)
  40051b: 00 
  40051c: c7 44 24 18 09 00 00  movl   $0x9,0x18(%rsp)
  400523: 00 
  400524: c7 44 24 10 08 00 00  movl   $0x8,0x10(%rsp)
  40052b: 00 
  40052c: c7 44 24 08 07 00 00  movl   $0x7,0x8(%rsp)
  400533: 00 
  400534: c7 04 24 06 00 00 00  movl   $0x6,(%rsp)
  40053b: 41 b9 05 00 00 00     mov    $0x5,%r9d
  400541: 41 b8 04 00 00 00     mov    $0x4,%r8d
  400547: b9 03 00 00 00        mov    $0x3,%ecx
  40054c: ba 02 00 00 00        mov    $0x2,%edx
  400551: be 01 00 00 00        mov    $0x1,%esi
  400556: bf 20 06 40 00        mov    $0x400620,%edi
  40055b: b8 00 00 00 00        mov    $0x0,%eax
  400560: e8 7b fe ff ff        callq  4003e0 
}
  400565: c9                    leaveq 
  400566: c3                    retq   
  400567: 90                    nop

As you can see, the numbers 1 to 5 get placed in the registers esi, edx, ecx, r8d, r9d, but the rest use %rsp in an interesting way. Consider this line:

movl       $0x7,0x8(%rsp)

%rsp is the stack pointer, and using the brackets is called dereferencing, this basically means it moves the value 0x7 into the location which is 8 bytes away from the stack.

Sixth compile: gcc hello3.c -g -O0 -fno-builtin -o step5

We are now using hello3.c which moves the printf to another function named output. The code is still mostly similar but as expected it moves all the instructions related to printf outside main into a new function.

000000000040050c <output>:
#include <stdio.h>
void output(){
  40050c: 55                    push   %rbp
  40050d: 48 89 e5              mov    %rsp,%rbp
  400510: 48 83 ec 30           sub    $0x30,%rsp
  printf("Hello Everybody , %d, %d, %d, %d, %d, %d, %d, %d, %d, %d \n",1,2,3,4,5,6,7,8,9,10);
  400514: c7 44 24 20 0a 00 00  movl   $0xa,0x20(%rsp)
  40051b: 00 
  40051c: c7 44 24 18 09 00 00  movl   $0x9,0x18(%rsp)
  400523: 00 
  400524: c7 44 24 10 08 00 00  movl   $0x8,0x10(%rsp)
  40052b: 00 
  40052c: c7 44 24 08 07 00 00  movl   $0x7,0x8(%rsp)
  400533: 00 
  400534: c7 04 24 06 00 00 00  movl   $0x6,(%rsp)
  40053b: 41 b9 05 00 00 00     mov    $0x5,%r9d
  400541: 41 b8 04 00 00 00     mov    $0x4,%r8d
  400547: b9 03 00 00 00        mov    $0x3,%ecx
  40054c: ba 02 00 00 00        mov    $0x2,%edx
  400551: be 01 00 00 00        mov    $0x1,%esi
  400556: bf 30 06 40 00        mov    $0x400630,%edi
  40055b: b8 00 00 00 00        mov    $0x0,%eax
  400560: e8 7b fe ff ff        callq  4003e0 <printf@plt>
}
  400565: c9                    leaveq 
  400566: c3                    retq   

0000000000400567 <main>:
int main(){
  400567: 55                    push   %rbp
  400568: 48 89 e5              mov    %rsp,%rbp
  output();
  40056b: b8 00 00 00 00        mov    $0x0,%eax
  400570: e8 97 ff ff ff        callq  40050c <output>
}
  400575: 5d                    pop    %rbp
  400576: c3                    retq   
  400577: 90                    nop

Seventh compile: gcc hello3.c -g -O3 -fno-builtin -o step6

We are changing the optimization level from 0 to 3, this will change our code quite a lot and add optimizations.

0000000000400520 <output>:
#include <stdio.h>
void output(){
  400520: 48 83 ec 38           sub    $0x38,%rsp
  printf("Hello Everybody , %d, %d, %d, %d, %d, %d, %d, %d, %d, %d \n",1,2,3,4,5,6,7,8,9,10);
  400524: 41 b9 05 00 00 00     mov    $0x5,%r9d
  40052a: 41 b8 04 00 00 00     mov    $0x4,%r8d
  400530: c7 44 24 20 0a 00 00  movl   $0xa,0x20(%rsp)
  400537: 00 
  400538: c7 44 24 18 09 00 00  movl   $0x9,0x18(%rsp)
  40053f: 00 
  400540: b9 03 00 00 00        mov    $0x3,%ecx
  400545: c7 44 24 10 08 00 00  movl   $0x8,0x10(%rsp)
  40054c: 00 
  40054d: c7 44 24 08 07 00 00  movl   $0x7,0x8(%rsp)
  400554: 00 
  400555: ba 02 00 00 00        mov    $0x2,%edx
  40055a: c7 04 24 06 00 00 00  movl   $0x6,(%rsp)
  400561: be 01 00 00 00        mov    $0x1,%esi
  400566: bf 30 06 40 00        mov    $0x400630,%edi
  40056b: 31 c0                 xor    %eax,%eax
  40056d: e8 6e fe ff ff        callq  4003e0 <printf@plt>
}
  400572: 48 83 c4 38           add    $0x38,%rsp
  400576: c3                    retq

As you can see our program has changed a bit, the order of the mov statements has changed and different statements are being used. For example consider these 2 lines:

 xor %eax,%eax ;O3

 mov $0x0,%eax ; O0

Both of these statements are doing the same thing, moving the value of 0 into eax. When optimizations are turned on it uses xor on a register to get the value of 0, this is much faster because it does not have to move any data at all.
Note: when you xor any value to itself the result is 0.

Conclusions

The complexity of all the compiler options is massive and this post only scratches the surface of all the options available. This provides some insight into the amount of assembly code generated even in the most basic C programs. It gives you an idea how complex assembly can get especially if the C code was larger, it also allows you to have a glimpse behind the scenes of C.

Thanks for reading.

Sunday, 1 February 2015

Lab 3

Lab 3, Python Profiling

We decided to not continue using MySQL chose to download python. We thought it would be a good idea to test it on both servers, australia: 64 bit x86 and red: 64 bit Arm.
We used python 3.4 using this link and wget to get it onto the server. After we un-tarred it (tar -xvf [filename] ) we ran the ./configure which gave us a makefile and then we made a few changes.
We changed line 72 in the makefile which was
BASECFLAGS= -Wno-unused-result
to
BASECFLAGS= -Wno-unused-result -pg

This allows python to output a gmon.out file when python executes, this contains all the profiling information.
We now needed to run something with Python so we would have a valid gmon.out file.
we found this resource: tests which showed us how to run python tests using the command
./python -m test

On Australia:
This went through 389 tests when it was completed we had a gmon.out file
To make the profile easier to read our professor suggested to use gprof2dot to generate a png image file of the profile, the command is:
gprof ./python | gprof2dot | dot -Tpng > Profile.png (full image)

On Red:
This went through most of the tests but stalled at the last one with a message:
_mcleanup: gmon.out: No such file or directory
(I later realized that message popped up a lot during the tests which suggests that it is not relevant to the stalling)
After waiting a long time I just used control-C to get out of the test and I still had a gmon.out with 388 tests. Again, used gprof2dot to get the image file of the profile (full image)

Although both pictures appear very different I was looking at the function calls and they are the same, it is simply the layout of the picture that changes drastically.

There is one function that stands out on both systems is PyEval_EvalFrameEx, it takes up 12% on x86 and 14% on Arm. I looked up the source code here and the function itself is around 2500 lines , it is safe to say I'm not sure what it is doing. It does have an interesting comment section line 827 about optimizations that this function has, this is essentially trying to avoid leading the CPU on a mispredicted branch, this relates to what Chris Tyler was talking about in our class(on 1/29/2015) when he mentioned CPUs guessing the correct path to follow.

In the end I think red's profile is hard to directly compare to australia's because of the difference of layout of the 2 files but the EvalFrameEx takes up the most usage on both systems. It seems like they are aware of this with the extensive optimization of it.

Thanks for reading

Lab 2

Lab 2, Benchmarking Mysql

For lab 2 we had to benchmark a software package. We chose MySql and found the link here and under the selected platform just choose Source Code. We were using a 64 bit x86 machine running fedora 21 and there was not a version for fedora so we just chose generic Linux option which is at the very bottom. We copied that link and used the wget command to install it via command line. Once we Un-tarred it (tar -xvf [filename]) we started looking for the configure script but it did not have one, it used cmake, not configure. So we installed cmake on the server and ran it and it gave us a makefile. We ran make without the useful -j option which allows it to take advantage of multiple cores, so that delayed us, we ending up waiting for a long time. Once it was made we needed to benchmark it, eventually we found out how to use the benchmark suite through this tutorial. We were running low on time in class so we stopped it after the first test but it gave us this results file:

per operation:
Operation                           seconds    usr       sys     cpu   tests
alter_table_add                      100.00    0.02    0.00    0.02     100
alter_table_drop                     101.00    0.01    0.00    0.01      91
create_index                           2.00        0.00    0.00    0.00       8
create_table                            11.00      0.00    0.00    0.00      28
drop_index                             2.00        0.00    0.00    0.00       8
drop_table                              14.00      0.00    0.00    0.00      28
insert                                 340.00    0.37    0.20    0.57    9768
select_distinct                          2.00        0.19    0.00    0.19     800
select_group                           1.00         0.24    0.02    0.26    2800
select_join                           0.00      0.06    0.00    0.06     100
select_key_prefix_join             1.00        0.34    0.00    0.34     100
select_simple_join                    1.00      0.08    0.00    0.08     500
TOTALS                               575.00     1.31    0.22    1.53   14331

I'm not quite sure if I understand this table, particularly why there is so many seconds but there was so little time taken on usr or sys side. My theory is that the usr and sys and cpu columns are the percent of cpu it was using at the time, but I could be wrong.

I think this lab was quite useful, we learned how to use cmake and how to speed up the make process with -j and we did get results, and although I might not understand them yet, they're still results!

Thanks for reading.