Summary: This module introduces the interactive optimizing compiler combination VPO/VISTA. VPO (Very Portable Optimizer) is a retargetable compiler developed at the University of Virginia. VISTA (VPO Interactive System for Tuning Applications), developed at Florida State University, is a GUI that allows a user to interactively select optimization steps with VPO feedback. Instructions are given on how to apply VPO/VISTA optimization to three routines provided in the power spectral density (PSD) estimator of the previous lab. Students are then asked to optimize these routines twice: once using VISTA and once using any method desired.
In this lab you will apply an interactive optimizing compiler to the IIR filtering and autocorrelation routines of the previous lab. You will also optimize the pseudo-noise (PN) generation, IIR filtering, and autocorrelation routines, using any technique desired, to get code that executes as fast as possible.
Limitations of Optimizing Compilers
Most code optimizations are done on an intermediate representation of the source code or at the assembly level (post-pass optimization). Read the Wikipedia [link] article on compiler optimization [link] and learn about optimization problems, types, factors, and techniques. Remember, no compiler optimization can effectively overcome a poor algorithm design or implementation.
Since compilers group transformations into phases, the user cannot choose exactly which transformations are performed nor can the user specify the order in which they are performed. Furthermore, the user cannot see the results of individual transformations. A particularly unfortunate case is when only one transformation in a particular phase causes buggy code to be produced. Designing compilers to overcome some of these limitations is an active area of research. One such optimizer and compiler that is being developed is VPO/VISTA.
VPO: Very Portable Optimizer
VPO is a compiler-independent and language-independent optimizer. [1]
"VPO employs a paradigm of compilation that has proven to be flexible and adaptable - all code improving transformations are performed on a single target-specific representation of the program, called RTLs [link] (Register Transfer Lists). RTLs are a low-level, machine and language independent representation that encodes machine-specific instructions. The comprehensive use of RTLs in VPO has several important consequences. One advantage of using RTLs as the sole intermediate representation is that the synthesis phases of the compiler can be invoked in any order and repeatedly if necessary. This largely eliminates code inefficiencies often caused by the ordering of the phases. In contrast, a more conventional compiler system will perform optimizations on various different representations."[2]VPO is also easily retargeted to new architectures and it can be extended to accomodate new architectural features.
VISTA: VPO Interactive System for Tuning Applications
VISTA is a front end to VPO that enables the user to optimize the code interactively. One view displays assembly or RTL instructions and another view takes input on which optimizations to perform. Each optimization phase and transformation can be stepped through to see which which instructions are modified.
Now we will apply the interactive research compiler, VPO/VISTA, to the autocorrelation routine. VPO runs on Solaris, while VISTA runs in Java. We will run Java from a local windows machine since Code Composer runs on Windows. After compilation into an internal representation is completed, Java is used to start a "code server" on a Solaris machine over a network port, and a Java client is run locally to interactively optimize the code. The port over which the connection is made changes each time the optimizer is run. Your TA will give you instructions on which machine to connect to and your account information. You will transfer
iirfilt.c
, autocorr.c
, and required header files to the remote machine. You will then compile autocorr.c
and apply a a set of VISTA optimization sequences to the code. Next, you will then transfer autocorr.vpo.asm
to the local machine and assemble it with the rest of PSD estimation application. Once you understand the VISTA optimization process and can interpret the results, you will use VISTA to fully optimize iirfilt.c
.Transfer necessary files to remote machine
First, change the value of
M
in your local copy of lab4b.h
to 23
. In Windows, open a "WinSCP" file browser by clicking Start
->All Programs
->WinSCP
. Log in to the remote Solaris machine, make a working directory on the remote machine (e.g. lab5
), and transfer your local copies of iirfilt.c
, autocorr.c
, lab4b.h
, and intrinsics.h
to the remote machine.Open an "SSH Secure Shell" terminal and compile autocorr.c remotely
Now open a secure terminal by selecting
Start
->All Programs
->Putty
. Login in and change the working directory to the one you placed your files in. Type vpocompile autocorr.c
to compile the file into autocorr.cex
. This new file contains RTLs that can be optimized by VISTA.Serve autocorr.cex remotely
If VISTA has been previously run on this code and you do not want any information from the last VISTA session kept, delete the files named
autocorr.trans
andautocorr.transout
by typing rm autocorr.trans*
in the SSH terminal. Now the RTL file must be served from the remote machine to the local Windows machine by typingvposerver autocorr.cex
in the SSH terminal. Note that the exact command to run VISTA (see next step) is printed on the screen, and it can be copied into the local command prompt.Run VISTA locally at a command prompt
Run VISTA on the local Windows machine by typing
vista_viewer port_number hostname
at a command prompt. The port number and hostname is given in the output of the vposerver
command.Optimize autocorr.c
When VISTA is run a JAVA GUI will be opened. The right half of the GUI contains the assembly instructions, represented in the VPO-internal register-transfer-list (RTL) format. The left half contains the list of phases. Each phase contains multiple transformations; the "CD-player-like" buttons allow the user to move to the beginning and end of the entire phase list ( |<>| ), move backward and forward one phase ( <<>> ), and move backward and forward one transformation ( <> ). The buttons directly below the phase list allow phases to be added to the list.
First, select
Ctl Flow(square)
from the drop-down box. Can you identify what each loop/branch (depicted via arrows) corresponds to in the source code?To see how the RTL instructions originally listed in the right half of the GUI relate to assembly instructions, select
Assembly
in the drop-down box to see the corresponding assembly instructions. Note that the accumulators and registers are labeled with increasing numbers. These labels help indicate to the optimizer when a register is "dead", meaning that its contents are no longer needed. These "registers" can be mapped to physical registers by selecting Setup Trans Sequence
, Register Assignment
, and Done
. Note that after Register Assignment
is selected, Eval Order Deter
and Global Inst Select
are no longer available and others are made available. After Done
is selected, you will then see familiar registers like A
, B
, AR1
, etc. Next to "Register Assignment" in the left half of the window are the number of transformations and the code size in percent of the original code.Now step through the individual register assigments. The > button will step through each transformation in two steps: pressing it once will highlight in yellow the instructions that may be modified (before-transformation state), and pressing it again will highlight the new instructions that replace the previous ones in light red (after-transformation state). Click on the <<> and observe the results. Stepping through transformations in
Inst Selection
is especially informative. Note the code size, then go back and try the disabled optimizations. First, click |< class="codeline" style="font-family: courier, 'courier new', monospace; font-size: 1.18em; color: rgb(0, 102, 0); font-style: normal; ">Option-> Discard changes after current state
to remove Register Assignment
. Now enter these optimizations: Eval Order Deter
, Global Inst Select
, and Register Assignment
. Note the improvement in code size. What happens if Global Inst Select
is instead selected first?Now remove all optimizations and enter the optimizations shown in figure 3. Note that
Fix Entry Exit
and Inst Scheduling
are required before assembly code can be written. If Fix Entry Exit
is left out, no code is generated to preserve and restore register states at the beginning and end of the function. Inst Scheduling
orders the instructions to avoid pipeline conflicts. If this optimization is left out, nops
are entered after each instruction. Note that even though Inst Scheduling
increases the reported size of the code, if it is left out the assembly code that VISTA writes to file will be even larger.To finish the VISTA session and use the optimized code, click the
Exit
button in the lower portion of the GUI to generate a vpo.asm
file that can be linked with the program. If you click on the "X" in the right side of the title bar, a vpo.asm
file will not be generated.Transfer autocorr.vpo.asm to local machine
Now the optimized file is ready to be linked to the rest of the PSD estimator. Using the SSH file browser, transfer
autocorr.vpo.asm
to the directory containing the PSD files on the local Windows machine.Compile, link, and test code locally at a command prompt
On the local windows machine, rename
autocorr.vpo.asm
to autocorrvpo.asm
since the Windows version of the TI compiler doesn't handle multiple periods in a filename properly. Compile and link the PSD estimator with the optimized autocorr.c
by typing c_asm lab4bmain.c pn.c iirfilt.c autocorrvpo.asm c_fft_given_iirc.asm
at the Windows command prompt. Load this executable onto the DSP and measure the number of clock cycles autocorr()
consumes. Is this code correct? Is it executing before the next buffer of N samples is filled? How can you tell?Optimize pn.c, iirfilt.c, and autocorr.c
After inspecting the source code of the PSD estimator, you will likely discover inefficiencies. This implementation is provided as the "reference implementation" of the optimization process and to define the expected input and output of the application. The computational efficiency of your code will be judged against this implementation. Since a primary purpose of this lab is to learn optimization and efficient code techniques, the optimization portion of your lab grade will be based on the total execution time of
rand_fillbuffer
, iirfilt
, and autocorr
, which are contained in pn.c
, iirfilt.c
, and autocorr.c
, respectively. You are not required to optimize memory use. You will not change lab4bmain.c
, and you will preserve the input and output behavior of each function. Note that by execution time we mean cycle count, not the number of instructions in your program. Remember that several of the TMS320C54xx instructions take more than one cycle. The multicycle instructions are primarily the multi-word instructions, including instructions that take immediates, like stm
, and instructions using direct addressing of memory (such as ld *(temp),A
). Branch and repeat statements also require several cycles to execute. Most C instructions take more than one cycle. The debugger can be used to determine the exact number of cycles used by your code; ask your TA to demonstrate. However, since the number of execution cycles used by an instruction is usually determined by the number of words in its encoding, the easiest way to estimate the number of cycles used by your code is to count the number of instruction words in the .lst
file or the disassembly window in the debugger.You will create an optimized version of this code using any technique desired, and you will create an optimized version of
iirfilt
with VISTA.VISTA optimization
Optimize
iirfilt
in iirfilt.c
. If VISTA is applied to other functions, note that a separate VISTA session must be run for each file. Also note that the reference implementation ofpn.c
contains two functions; after the first function has been optimized, and Fix Entry Exit
and Inst Scheduling
have been selected, select Option
-> Proceed to next function
to optimize the second function.Descriptions of many of the optimizations can be found in Appendix A. Some general guidelines for optimization:
- Make use of optimizations listed at or above
Register Assignment
before selectingRegister Assignment
- Some optimizations can be effectively used many times throught the list, such as
Inst Selection
,Dead Variable Elim
, andCommon Subexpr Elim
- De-optimizing phases such as
Deopt Reg Allocation
can be used to open up new optimization possibilities for future phases - You may skip the programming constructs listed at and below
if changes then
- You should be able to optimize the reference implementation of
iirfilt
to 25% (or less) of the original beforeFix Entry Exit
andInst Scheduling
are selected. - Some optimization sequences may produce buggy code. When testing each
.vpo.asm
file for correctness, do not link any other VISTA-generated files in the executable. For example, if you are testingautocorr.vpo.asm
, passpn.c
andiirfilt.c
through the TI compiler. Another reason the functions may not work properly is because it cannot keep up with the samples coming in. For some of the provided functions, any optimized version that is above 40% the original size is almost surely too large to execute quickly enough. - Each time a VISTA is exited when
Fix Entry Exit
andInst Scheduling
are selected, a.vpo.asm
file is produced. To run VISTA again on the same function requiresrm
to be executed before.trans* vposerver
is run again. - Before you begin entering transformations, select
start writing in
to start recording to a local file with the name that has been entered into the edit box next to this button. When you have completed the transformations that you want to record, select the record button again to stop recording. When you want to execute transformations from a file, enter the filename into the edit box and selectexecute from file
. Recording transformations can save a significant amount of time when you want to try variations of the same transformation sequence.
If you spot inefficiencies with a function's C code before you have tried optimizing it VISTA, or if you are not satisfied with your best reference-implementation optimization using VISTA, you may change the C code and re-optimize with VISTA. Recall that optimizing in C is a step before optimizing with hand-coded assembly. Look carefully at
iirfilt.c
and included files before optimizing. Some improvements may be made simply by modifying the C code. Modifications are valid as long as the modified code takes a "live" input and producesexactly the same output as the reference iirfilt
code. Note that hand-optimizing the VISTA-generated assembly code counts as "freestyle-optimized" code, not VISTA-only-optimized code. Be sure to write down your best VISTA optimization sequence and be prepared to show it to the TA. Restart VISTA and check to see that the sequence gives the same assembly code and reduction in code size percentage.While you are optimizing with VISTA, step through some transformations and figure out what they do, especially
Inst Selection
, Common Subexpr Elim
, Dead Variable Elimination
, Register Allocation
, Merge Basic Blocks
, and Elim Empty Blocks
. Also, correlate the VISTA blocks and assembly instructions with the C code. What does each block correspond to, and what part of the C code are the assembly instructions implementing? You may find it useful to look for key instructions like ones that perform multiplication."Free-style" optimization
For this optimization activity you can use any technique desired. While the reference code might serve as a starting point, you should do whatever you need to do to make your code as efficient as possible, while operating in an equivalent manner as the given code. Be sure to review the steps involved in optimization. You may use VISTA-generated assembly code as a starting point or for general guidance. Similarly, you can use TI-generated assembly code or VPO-only code. To generate VPO-only code, use the command
vpocompile2
on the remote machine. This command will run VPO with a fixed optimization sequence. A .s
assembly file will be created that can be transferred to the local machine and assembled with the rest of the PSD estimator. TI-generated assembly code is produced for each file that is passed to c_asm
. While you are optimizing the code in a "free-style" manner, be sure to note what optimizing compilers can and can't do in optimizing code.Routine-Specific Optimization Tips
If you are programming the PN generator in assembly, you may wish to refer to the description of assembly instructions for logical operations in Section 2-2 of the C54x Mnemonic Instruction Set reference. Initialize the shift register to one. You can debug the PN output by comparing it to the output of the MATLAB code. Be prepared, as part of your quiz, to prove to a TA that your PN generator works properly.
Your IIR filtering routine can debugged by placing an impulse followed by zeros in
autocorr_in
instead of a PN sequence.Your autocorrelation routine can be debugged by commenting out the IIR-filtering routine and placing the maximum DC value into
autocorr_in
in a similar manner as described the IIR-debugging step. Note that each of these tips is the most useful if the output is inspected in memory.Grading
We will grade you based on the number of cycles used in the statements
rand_fillbuffer();
, iirfilt();
, and autocorr();
. Note that some instructions, like RPT
, are non-repeatable instructions; their use may cause unnecessary glitches in I/O. For grading simplicity, your final code should not have modifications except in these three functions, and M
should be set to 23
. If the number of cycles used in each function is variable, the maximum possible number of cycles will be counted for each function. You must use the core.asm
file inv:\ece320\54x\dsplib\core.asm
or the C core file in v:\ece320\54x\dspclib\core.asm
as provided by the TAs; these files may not be modified. We reserve the right to test your code by modifying the inputs.You are also required to record some of your
iirfilt
VISTA results to a file so that improvements to VISTA can be identified. Each time you test a VISTA-generated function on the DSP, (1) record (yes or no) whether it works, and (2) how many cycles it takes to execute this function. You are not required to test every sequence you generate; if you are not happy with the code size of a sequence, keep trying until you have one worthy of testing. Once you test the code, record the results. When you have finished optimizing a function with VISTA, (1) write down a rough estimate of the total time you spent optimizing the function with VISTA (to the nearest 15 minutes or less), and (2) write down the VISTA optimization sequence that can be used to obtain your optimized code. Finally, record at what stage, if any, you altered the C code of the reference implementation. Valid answers are not at all, before VISTA was used, or after some VISTA optimization was attempted.Record how much time you spent on the "free-style" optimization activity, as well as whether you used generated code as a starting point or "started from scratch". If you used VISTA, VPO, or TI generated code only for guidance this is to be recorded as "starting from scratch" or "None" in the free-style results table. If you modified a generated assembly file, record this as "VISTA", "VPO", or "TI" in the free-style results table.
Email your results at or before the time they are due. The address will be given to you by your TA. Enter in the subject line "[VISTA]", without the quotes and a space where appears. After the space you may put anything you want. Email your C code used in optimization (if it is different from the reference implementation), your free-style-optimized assembly code, and the results of your optimization process. Please attach the results as text; if you used a spreadsheet to record the results, export the file to a CSV file and attach the CSV file. Optionally, add any comments you might have about your experience with VISTA. Here are two tables listing what is required:
VISTA Optimization | Optimized function | |
---|---|---|
iirfilt() | ||
Iteration | Works? | Cycle time |
1 | ||
2 | ||
3 | ||
... | ||
Best VISTA | Yes | |
Best VISTA Optimization Sequence | ||
Time spent using VISTA | ||
When was C code altered? |
"Free-style" optimization | Optimized function | |
---|---|---|
iirfilt() | ||
Time Spent Optimizing | Cycle time | |
Best Free-style | ||
Which generated code modified? |
This lab is to be completed in two weeks, but the VISTA portion is to be completed in one week and the VISTA log should be sent in at that time. Grading for this lab will be a bit different from past labs:
- 1 point: Working VISTA version of
iirfilt
, due the first week. - 2 points: Cycle count of the
iirfilt
version optimized with VISTA, due the first week. - 1 point: Completing and emailing your code and "optimization logs", due the first week (VISTA C code and log) and second week (free-style code and log).
- 1 point: Working versions of the three free-style-optimized routines, due the second week.
- 4 points: Cycle counts of the three routines optimized in the free-style exercise, due the second week.
- 1 points: Quiz on VISTA during the second week.
Appendix A: Descriptions of some optimization phases
Optimization Phase | Description |
---|---|
branch chaining | Replaces a branch or jump target with the target of the last jump in the chain |
eliminate empty block | Removes empty blocks from the control flow graph |
useless jump elimination | Remove useless transfers of control like a jump to the next block in the control flow |
dead code elimination | Remove block unreachable from the top block |
reverse branches | Reverses a conditional branch when it branches over a jump to eliminate the jump |
block reordering | Removes a jump by reordering basic blocks when the target of the jump has only a single predecessor |
merge basic blocks | Merges two consecutive basic blocks when the predecessor has no transfer of control and the successor has only one predecessor |
instruction selection | Combine instructions together when the combined effect is in a legal instruction |
evaluation order determination | Reorder instructions within a basic block to calculate expressions that require the most registers first |
minimize loop jumps | Remove an unconditional jump at the end of a loop or one that jumps into a loop, by replicating a portion of the loop |
dead assignment elimination | Removes assignments where the assignment value is never used |
register allocation | Replaces references to a variable within a specific live range with a register |
common subexpression elimination | Eliminates fully redundant calculations |
loop transformations | Performs loop-invariant code motion, recurrence elimination, loop strength reduction, and induction variable elimination on each loop ordered by loop nesting level. Each of these transformations can also be individually selected by the user. |
strength reduction | Replaces an expensive instruction with one or more cheaper ones |
REFERENCES
- Zephyr Investigators. Machine-independent optimization. [http://www.cs.virginia.edu/zephyr/vpo].
- P. Kulkarni. (2003, August). Performance Driven Optimization Tuning in VISTA. [http://www.cs.fsu.edu/~whalley/papers/kulkarni_thesis03.ps]. Florida State University.
- W. Zhao, B. Cai, D. Whalley, M. Bailey, R. van Engelen, X. Yuan, J. Hiser, J. Davidson, K. Gallivan, and D. Jones. (2002, June). VISTA: A System for Interactive Code Improvement. InProceedings of the ACM SIGPLAN Conference on Language, Compilers, and Tools for Embedded Systems. (pp. 155-164). [http://www.cs.fsu.edu/~whalley/papers/lctes02.ps].
- M. R. Boyd; D. B. Whalley. (1995, June). Graphical Visualization of Compiler Optimizations. [http://www.cs.fsu.edu/~whalley/papers/jopl95.ps]. Journal of Programming Languages, 69-94.
0 comments:
Post a Comment