Spectrum Analyzer: VPO/VISTA Optimization

Summary: This module introduces the interactive optimizing compiler combination VPO/VISTA. VPO (Very Portable Optimizer) is a retargetable compiler developed at the University of Virginia. VISTA (VPO Interactive System for Tuning Applications), developed at Florida State University, is a GUI that allows a user to interactively select optimization steps with VPO feedback. Instructions are given on how to apply VPO/VISTA optimization to three routines provided in the power spectral density (PSD) estimator of the previous lab. Students are then asked to optimize these routines twice: once using VISTA and once using any method desired.

In this lab you will apply an interactive optimizing compiler to the IIR filtering and autocorrelation routines of the previous lab. You will also optimize the pseudo-noise (PN) generation, IIR filtering, and autocorrelation routines, using any technique desired, to get code that executes as fast as possible.

Limitations of Optimizing Compilers

Most code optimizations are done on an intermediate representation of the source code or at the assembly level (post-pass optimization). Read the Wikipedia [link] article on compiler optimization [link] and learn about optimization problems, types, factors, and techniques. Remember, no compiler optimization can effectively overcome a poor algorithm design or implementation.

Since compilers group transformations into phases, the user cannot choose exactly which transformations are performed nor can the user specify the order in which they are performed. Furthermore, the user cannot see the results of individual transformations. A particularly unfortunate case is when only one transformation in a particular phase causes buggy code to be produced. Designing compilers to overcome some of these limitations is an active area of research. One such optimizer and compiler that is being developed is VPO/VISTA.

VPO: Very Portable Optimizer

VPO is a compiler-independent and language-independent optimizer. [1]

"VPO employs a paradigm of compilation that has proven to be flexible and adaptable - all code improving transformations are performed on a single target-specific representation of the program, called RTLs [link] (Register Transfer Lists). RTLs are a low-level, machine and language independent representation that encodes machine-specific instructions. The comprehensive use of RTLs in VPO has several important consequences. One advantage of using RTLs as the sole intermediate representation is that the synthesis phases of the compiler can be invoked in any order and repeatedly if necessary. This largely eliminates code inefficiencies often caused by the ordering of the phases. In contrast, a more conventional compiler system will perform optimizations on various different representations."[2]

VPO is also easily retargeted to new architectures and it can be extended to accomodate new architectural features.

VISTA: VPO Interactive System for Tuning Applications

VISTA is a front end to VPO that enables the user to optimize the code interactively. One view displays assembly or RTL instructions and another view takes input on which optimizations to perform. Each optimization phase and transformation can be stepped through to see which which instructions are modified.

Now we will apply the interactive research compiler, VPO/VISTA, to the autocorrelation routine. VPO runs on Solaris, while VISTA runs in Java. We will run Java from a local windows machine since Code Composer runs on Windows. After compilation into an internal representation is completed, Java is used to start a "code server" on a Solaris machine over a network port, and a Java client is run locally to interactively optimize the code. The port over which the connection is made changes each time the optimizer is run. Your TA will give you instructions on which machine to connect to and your account information. You will transfer iirfilt.c, autocorr.c, and required header files to the remote machine. You will then compile autocorr.cand apply a a set of VISTA optimization sequences to the code. Next, you will then transfer autocorr.vpo.asm to the local machine and assemble it with the rest of PSD estimation application. Once you understand the VISTA optimization process and can interpret the results, you will use VISTA to fully optimize iirfilt.c.

Transfer necessary files to remote machine

First, change the value of M in your local copy of lab4b.h to 23. In Windows, open a "WinSCP" file browser by clicking Start->All Programs->WinSCP. Log in to the remote Solaris machine, make a working directory on the remote machine (e.g. lab5), and transfer your local copies of iirfilt.c, autocorr.c, lab4b.h, and intrinsics.h to the remote machine.

Open an "SSH Secure Shell" terminal and compile autocorr.c remotely

Now open a secure terminal by selecting Start->All Programs->Putty. Login in and change the working directory to the one you placed your files in. Type vpocompile autocorr.c to compile the file into autocorr.cex. This new file contains RTLs that can be optimized by VISTA.

Serve autocorr.cex remotely

If VISTA has been previously run on this code and you do not want any information from the last VISTA session kept, delete the files named autocorr.trans andautocorr.transout by typing rm autocorr.trans* in the SSH terminal. Now the RTL file must be served from the remote machine to the local Windows machine by typingvposerver autocorr.cex in the SSH terminal. Note that the exact command to run VISTA (see next step) is printed on the screen, and it can be copied into the local command prompt.

Run VISTA locally at a command prompt

Run VISTA on the local Windows machine by typing vista_viewer port_number hostname at a command prompt. The port number and hostname is given in the output of the vposerver command.

Optimize autocorr.c

When VISTA is run a JAVA GUI will be opened. The right half of the GUI contains the assembly instructions, represented in the VPO-internal register-transfer-list (RTL) format. The left half contains the list of phases. Each phase contains multiple transformations; the "CD-player-like" buttons allow the user to move to the beginning and end of the entire phase list ( |<>| ), move backward and forward one phase ( <<>> ), and move backward and forward one transformation ( <> ). The buttons directly below the phase list allow phases to be added to the list.

**Figure 1:** Initial VISTA view with `Ctl Flow(square)` selected

First, select Ctl Flow(square) from the drop-down box. Can you identify what each loop/branch (depicted via arrows) corresponds to in the source code?

**Figure 2:** Initial VISTA view with instructions displayed in assembly form

To see how the RTL instructions originally listed in the right half of the GUI relate to assembly instructions, select Assembly in the drop-down box to see the corresponding assembly instructions. Note that the accumulators and registers are labeled with increasing numbers. These labels help indicate to the optimizer when a register is "dead", meaning that its contents are no longer needed. These "registers" can be mapped to physical registers by selecting Setup Trans Sequence, Register Assignment, and Done. Note that after Register Assignment is selected, Eval Order Deter and Global Inst Select are no longer available and others are made available. After Done is selected, you will then see familiar registers like A, B, AR1, etc. Next to "Register Assignment" in the left half of the window are the number of transformations and the code size in percent of the original code.

Now step through the individual register assigments. The > button will step through each transformation in two steps: pressing it once will highlight in yellow the instructions that may be modified (before-transformation state), and pressing it again will highlight the new instructions that replace the previous ones in light red (after-transformation state). Click on the <<> and observe the results. Stepping through transformations in Inst Selection is especially informative. Note the code size, then go back and try the disabled optimizations. First, click |< class="codeline" style="font-family: courier, 'courier new', monospace; font-size: 1.18em; color: rgb(0, 102, 0); font-style: normal; ">Option-> Discard changes after current state to remove Register Assignment. Now enter these optimizations: Eval Order Deter, Global Inst Select, and Register Assignment. Note the improvement in code size. What happens if Global Inst Select is instead selected first?

**Figure 3:** VISTA optimization phase entry

Now remove all optimizations and enter the optimizations shown in figure 3. Note that Fix Entry Exit and Inst Scheduling are required before assembly code can be written. If Fix Entry Exit is left out, no code is generated to preserve and restore register states at the beginning and end of the function. Inst Scheduling orders the instructions to avoid pipeline conflicts. If this optimization is left out, nops are entered after each instruction. Note that even though Inst Scheduling increases the reported size of the code, if it is left out the assembly code that VISTA writes to file will be even larger.

**Figure 4:** VISTA view before a particular transformation is applied

To finish the VISTA session and use the optimized code, click the Exit button in the lower portion of the GUI to generate a vpo.asm file that can be linked with the program. If you click on the "X" in the right side of the title bar, a vpo.asm file will not be generated.

Transfer autocorr.vpo.asm to local machine

Now the optimized file is ready to be linked to the rest of the PSD estimator. Using the SSH file browser, transfer autocorr.vpo.asm to the directory containing the PSD files on the local Windows machine.

Compile, link, and test code locally at a command prompt

On the local windows machine, rename autocorr.vpo.asm to autocorrvpo.asm since the Windows version of the TI compiler doesn't handle multiple periods in a filename properly. Compile and link the PSD estimator with the optimized autocorr.c by typing c_asm lab4bmain.c pn.c iirfilt.c autocorrvpo.asm c_fft_given_iirc.asm at the Windows command prompt. Load this executable onto the DSP and measure the number of clock cycles autocorr() consumes. Is this code correct? Is it executing before the next buffer of N samples is filled? How can you tell?

Optimize pn.c, iirfilt.c, and autocorr.c

After inspecting the source code of the PSD estimator, you will likely discover inefficiencies. This implementation is provided as the "reference implementation" of the optimization process and to define the expected input and output of the application. The computational efficiency of your code will be judged against this implementation. Since a primary purpose of this lab is to learn optimization and efficient code techniques, the optimization portion of your lab grade will be based on the total execution time of rand_fillbuffer, iirfilt, and autocorr, which are contained in pn.c, iirfilt.c, and autocorr.c, respectively. You are not required to optimize memory use. You will not change lab4bmain.c, and you will preserve the input and output behavior of each function. Note that by execution time we mean cycle count, not the number of instructions in your program. Remember that several of the TMS320C54xx instructions take more than one cycle. The multicycle instructions are primarily the multi-word instructions, including instructions that take immediates, like stm, and instructions using direct addressing of memory (such as ld *(temp),A). Branch and repeat statements also require several cycles to execute. Most C instructions take more than one cycle. The debugger can be used to determine the exact number of cycles used by your code; ask your TA to demonstrate. However, since the number of execution cycles used by an instruction is usually determined by the number of words in its encoding, the easiest way to estimate the number of cycles used by your code is to count the number of instruction words in the .lst file or the disassembly window in the debugger.

You will create an optimized version of this code using any technique desired, and you will create an optimized version of iirfilt with VISTA.

VISTA optimization

Optimize iirfilt in iirfilt.c. If VISTA is applied to other functions, note that a separate VISTA session must be run for each file. Also note that the reference implementation ofpn.c contains two functions; after the first function has been optimized, and Fix Entry Exit and Inst Scheduling have been selected, select Option-> Proceed to next function to optimize the second function.

Descriptions of many of the optimizations can be found in Appendix A. Some general guidelines for optimization:

Make use of optimizations listed at or above Register Assignment before selecting Register Assignment

Some optimizations can be effectively used many times throught the list, such as Inst Selection, Dead Variable Elim, and Common Subexpr Elim

De-optimizing phases such as Deopt Reg Allocation can be used to open up new optimization possibilities for future phases

You may skip the programming constructs listed at and below if changes then

You should be able to optimize the reference implementation of iirfilt to 25% (or less) of the original before Fix Entry Exit and Inst Scheduling are selected.

Some optimization sequences may produce buggy code. When testing each .vpo.asm file for correctness, do not link any other VISTA-generated files in the executable. For example, if you are testing autocorr.vpo.asm, pass pn.c and iirfilt.c through the TI compiler. Another reason the functions may not work properly is because it cannot keep up with the samples coming in. For some of the provided functions, any optimized version that is above 40% the original size is almost surely too large to execute quickly enough.

Each time a VISTA is exited when Fix Entry Exit and Inst Scheduling are selected, a .vpo.asm file is produced. To run VISTA again on the same function requiresrm .trans* to be executed before vposerver is run again.

Before you begin entering transformations, select start writing in to start recording to a local file with the name that has been entered into the edit box next to this button. When you have completed the transformations that you want to record, select the record button again to stop recording. When you want to execute transformations from a file, enter the filename into the edit box and select execute from file. Recording transformations can save a significant amount of time when you want to try variations of the same transformation sequence.

If you spot inefficiencies with a function's C code before you have tried optimizing it VISTA, or if you are not satisfied with your best reference-implementation optimization using VISTA, you may change the C code and re-optimize with VISTA. Recall that optimizing in C is a step before optimizing with hand-coded assembly. Look carefully at iirfilt.c and included files before optimizing. Some improvements may be made simply by modifying the C code. Modifications are valid as long as the modified code takes a "live" input and producesexactly the same output as the reference iirfilt code. Note that hand-optimizing the VISTA-generated assembly code counts as "freestyle-optimized" code, not VISTA-only-optimized code. Be sure to write down your best VISTA optimization sequence and be prepared to show it to the TA. Restart VISTA and check to see that the sequence gives the same assembly code and reduction in code size percentage.

While you are optimizing with VISTA, step through some transformations and figure out what they do, especially Inst Selection, Common Subexpr Elim, Dead Variable Elimination, Register Allocation, Merge Basic Blocks, and Elim Empty Blocks. Also, correlate the VISTA blocks and assembly instructions with the C code. What does each block correspond to, and what part of the C code are the assembly instructions implementing? You may find it useful to look for key instructions like ones that perform multiplication.

"Free-style" optimization

For this optimization activity you can use any technique desired. While the reference code might serve as a starting point, you should do whatever you need to do to make your code as efficient as possible, while operating in an equivalent manner as the given code. Be sure to review the steps involved in optimization. You may use VISTA-generated assembly code as a starting point or for general guidance. Similarly, you can use TI-generated assembly code or VPO-only code. To generate VPO-only code, use the command vpocompile2 on the remote machine. This command will run VPO with a fixed optimization sequence. A .s assembly file will be created that can be transferred to the local machine and assembled with the rest of the PSD estimator. TI-generated assembly code is produced for each file that is passed to c_asm. While you are optimizing the code in a "free-style" manner, be sure to note what optimizing compilers can and can't do in optimizing code.

Routine-Specific Optimization Tips

If you are programming the PN generator in assembly, you may wish to refer to the description of assembly instructions for logical operations in Section 2-2 of the C54x Mnemonic Instruction Set reference. Initialize the shift register to one. You can debug the PN output by comparing it to the output of the MATLAB code. Be prepared, as part of your quiz, to prove to a TA that your PN generator works properly.

Your IIR filtering routine can debugged by placing an impulse followed by zeros in autocorr_in instead of a PN sequence.

Your autocorrelation routine can be debugged by commenting out the IIR-filtering routine and placing the maximum DC value into autocorr_in in a similar manner as described the IIR-debugging step. Note that each of these tips is the most useful if the output is inspected in memory.

Grading

We will grade you based on the number of cycles used in the statements rand_fillbuffer();, iirfilt();, and autocorr();. Note that some instructions, like RPT, are non-repeatable instructions; their use may cause unnecessary glitches in I/O. For grading simplicity, your final code should not have modifications except in these three functions, and M should be set to 23. If the number of cycles used in each function is variable, the maximum possible number of cycles will be counted for each function. You must use the core.asm file inv:\ece320\54x\dsplib\core.asm or the C core file in v:\ece320\54x\dspclib\core.asm as provided by the TAs; these files may not be modified. We reserve the right to test your code by modifying the inputs.

You are also required to record some of your iirfilt VISTA results to a file so that improvements to VISTA can be identified. Each time you test a VISTA-generated function on the DSP, (1) record (yes or no) whether it works, and (2) how many cycles it takes to execute this function. You are not required to test every sequence you generate; if you are not happy with the code size of a sequence, keep trying until you have one worthy of testing. Once you test the code, record the results. When you have finished optimizing a function with VISTA, (1) write down a rough estimate of the total time you spent optimizing the function with VISTA (to the nearest 15 minutes or less), and (2) write down the VISTA optimization sequence that can be used to obtain your optimized code. Finally, record at what stage, if any, you altered the C code of the reference implementation. Valid answers are not at all, before VISTA was used, or after some VISTA optimization was attempted.

Record how much time you spent on the "free-style" optimization activity, as well as whether you used generated code as a starting point or "started from scratch". If you used VISTA, VPO, or TI generated code only for guidance this is to be recorded as "starting from scratch" or "None" in the free-style results table. If you modified a generated assembly file, record this as "VISTA", "VPO", or "TI" in the free-style results table.

Email your results at or before the time they are due. The address will be given to you by your TA. Enter in the subject line "[VISTA]", without the quotes and a space where appears. After the space you may put anything you want. Email your C code used in optimization (if it is different from the reference implementation), your free-style-optimized assembly code, and the results of your optimization process. Please attach the results as text; if you used a spreadsheet to record the results, export the file to a CSV file and attach the CSV file. Optionally, add any comments you might have about your experience with VISTA. Here are two tables listing what is required:

TABLE 1
VISTA Optimization	Optimized function
VISTA Optimization	iirfilt()
Iteration	Works?	Cycle time
1
2
3
...
Best VISTA	Yes
Best VISTA Optimization Sequence
Time spent using VISTA
When was C code altered?

TABLE 2
"Free-style" optimization	Optimized function
	iirfilt()
	Time Spent Optimizing	Cycle time
Best Free-style
Which generated code modified?

This lab is to be completed in two weeks, but the VISTA portion is to be completed in one week and the VISTA log should be sent in at that time. Grading for this lab will be a bit different from past labs:

1 point: Working VISTA version of iirfilt, due the first week.

2 points: Cycle count of the iirfilt version optimized with VISTA, due the first week.

1 point: Completing and emailing your code and "optimization logs", due the first week (VISTA C code and log) and second week (free-style code and log).

1 point: Working versions of the three free-style-optimized routines, due the second week.

4 points: Cycle counts of the three routines optimized in the free-style exercise, due the second week.

1 points: Quiz on VISTA during the second week.

Appendix A: Descriptions of some optimization phases

TABLE 3
Optimization Phase	Description
branch chaining	Replaces a branch or jump target with the target of the last jump in the chain
eliminate empty block	Removes empty blocks from the control flow graph
useless jump elimination	Remove useless transfers of control like a jump to the next block in the control flow
dead code elimination	Remove block unreachable from the top block
reverse branches	Reverses a conditional branch when it branches over a jump to eliminate the jump
block reordering	Removes a jump by reordering basic blocks when the target of the jump has only a single predecessor
merge basic blocks	Merges two consecutive basic blocks when the predecessor has no transfer of control and the successor has only one predecessor
instruction selection	Combine instructions together when the combined effect is in a legal instruction
evaluation order determination	Reorder instructions within a basic block to calculate expressions that require the most registers first
minimize loop jumps	Remove an unconditional jump at the end of a loop or one that jumps into a loop, by replicating a portion of the loop
dead assignment elimination	Removes assignments where the assignment value is never used
register allocation	Replaces references to a variable within a specific live range with a register
common subexpression elimination	Eliminates fully redundant calculations
loop transformations	Performs loop-invariant code motion, recurrence elimination, loop strength reduction, and induction variable elimination on each loop ordered by loop nesting level. Each of these transformations can also be individually selected by the user.
strength reduction	Replaces an expensive instruction with one or more cheaper ones

REFERENCES

Zephyr Investigators. Machine-independent optimization. [http://www.cs.virginia.edu/zephyr/vpo].

P. Kulkarni. (2003, August). Performance Driven Optimization Tuning in VISTA. [http://www.cs.fsu.edu/~whalley/papers/kulkarni_thesis03.ps]. Florida State University.

W. Zhao, B. Cai, D. Whalley, M. Bailey, R. van Engelen, X. Yuan, J. Hiser, J. Davidson, K. Gallivan, and D. Jones. (2002, June). VISTA: A System for Interactive Code Improvement. InProceedings of the ACM SIGPLAN Conference on Language, Compilers, and Tools for Embedded Systems. (pp. 155-164). [http://www.cs.fsu.edu/~whalley/papers/lctes02.ps].

M. R. Boyd; D. B. Whalley. (1995, June). Graphical Visualization of Compiler Optimizations. [http://www.cs.fsu.edu/~whalley/papers/jopl95.ps]. Journal of Programming Languages, 69-94.