Refactoring and Testing - Secrets of the Rewrite Tool, John Brant, Don Roberts, The Refactory Inc., www.refactory.com/Software, brant / roberts @refactory.com
The refactoring browser was first released in 1994 so is about to have its tenth birthday. The core of the RB is the rewrite framework and this is exposed to the user.
A refactoring is a transformation with several preconditions. This tutorial talks mainly about transformations as preconditions are fairly easy. (An example of a not-so-easy precondition is extracting an instance variable to something you reference. Proving this is a refactoring requires proving they are in a one-one correspondence, which is very hard. However, if you created the class in an immediately prior change then you know that the post-conditions of the class-creation refactoring satisfy these otherwise unprovable pre-conditions.) The refactoring framework supports
- custom refactorings: generic rewrites that the RB does not currently provide
- bulk transformations: your project needs to change a project-specific pattern to a new form
- changing layers: e.g. build a new DB layer, find and change 17,000 references to old layer
- migrations: e.g. synchrony's VSE to VA tool uses the rewrite framework.
Parsing is key. A method body is free-form text conforming to a grammar, a set of rules. Don showed these rules in BNF. Equally, these rules let you derive the derivation tree, with the method text being chuncked into the leaves of the tree, i.e. the terminals (selector symbol, varName, etc.).
John and Don simplified the derivation tree into a parse tree which, for example, has only a single class for method name, with unary, binary and selector just being a data state of the method name instead of further branches in the tree, and other terminals being instVar values of their owning non-terminal nodes.
John and Don had to think about comment placement and formatting which most parsers don't care about. In the latest version, wherever possible, they do replace-in-place to avoid reformatting the method.
The rewrite tool is a parse tree matcher whose language is a superset of smalltalk, which makes for both ease of use and ease of reuse. The rewriter can search and it can also replace (i.e. rewrite). One use of search is Smalllint (now renamed Code Critic). One use of rewrite is to do the refactoring browser's refactorings.
Pattern matching is accomplished by unification between the pattern and the code. The input is the tree to search and the pattern to find. The rewriter tries to find an assignment to the variables which matches and returns the found boolean plus this set of assignments if found. Matches can fail though incompleteness or through inconsistency, e.g. X := X pattern matches foo := foo but not foo foo (incomplete) or foo := bar (inconsistent). A rewrite then takes the matched variable assignments and outputs them in the requested rewritten form.
First easy error: the rewriter is not a textual search and replace tool, e.g a variable replace will not replace a method name call. (Use Vassili's regexpr utility if textual search and replace is what you want; sometimes, of course, a purely textual search and replace is the right solution.)
John showed it finding and rewriting smalltalk code, displaying the changes inspector. (This inspector gives you a moment to pause and reflect, letting you bail out if the rewrite has found much more or much other than you thought it would, or rewritten it quite unlike what you expected.) He then showed changing some of the smalltalk into pattern variables and rerunning the rewriter, displaying the proposed changes in the inspector.
- Simple variables: 'foo - matches any variable or pseudo-variable
- Simple literals: '#foo - matches any literal, nil
- List or subtree variables: '@foo matches any subtree, and can be thought of as 'match anything' (any var and any literal)
- recursive search - ''@foo will look within what it has found, e.g. ''receiver parent will match both cases in x parent parent, whereas without the second backquote only the outermost case would be rewritten. Rewrites that (could) match expressions with blocks in them often need recursive search.
To match an entire statement, use a period: '.statement will match a statement. Thus
'.duplicate.
'.duplicate
should match two identical consecutive statements that are the whole body of a sequence node. However as soon as we get beyond a within-statement expressions, we are matching sequence nodes. Matching two statements within a sequence node requires
'.@beforeStatements.
'.duplicate.
'.duplicate.
'.afterStatements
Because the . makes the tool build a sequence node, you must provide the 'zero or more statements before and after' code, unlike the expression case where it could match an expression within a longer expression. If the sequence node has temporaries, the above will not match. You must use
| '@temps |
'.@beforeStatements.
'.duplicate.
'.duplicate.
'.afterStatements
which will match two duplicate statements within any sequence of statements. It can be tricky to match sequence nodes; Don admitted he usually took two or three goes to get his expression right.
Q. Have that template appear when the tool comes up, as a method template appears in standard smalltalk? Some discussion, including of adding it to the Custom Refactoring project's right-click menu.)
They then showed renaming methods, noting that the change inspector only showed the new method as the old had been renamed.
Q. Anyone use this for code generation? Usually not so much for raw code generation as for bulk code alteration. Will Loew-Blosser's paper at OOPSLA 2002 described using it to replace one database layer with another. Will made 16,000+ changes using 200 rules that he scripted to change all DB access code over a weekend for a system that had to be up and running again on Monday (database access times were reduced from 40% to 2%). Aaron Hoffer made a similar number of changes at Washington Mutual, and nearly lost his job. The changes worked fine but his managers told him that 10,000 changes to the code base in a day rang every metrics-based alarm bell they had. Don wrote rewrite patterns to change a large system from old to new ANSI-standard exceptions; it took him half-an-hour including writing the patterns. You may want to match methods with a specific number of arguments, e.g.
'@receiver 'keyword1: '@arg1 'keyword2: '@arg2
or any number of args,
'@receiver '@keyword1: '@arg1
or a mix of specific and general args,
'@receiver at: '@arg1 '@keyword1: '@arg2
which will match any call of a two-parameter method whose first keyword is at:, noting that in e.g.
(conferences at: #SmalltalkSolutions put: self agenda)
at: #RBtutorial ifAbsentPut: #excellent
only the outer at-method would be matched because '@receiver does not request recursive matching; write ''@receiver to match both. Likewise ''@arg1, ''@arg2 would search for further matches in any matched args.
John remarked that he suspects an issue in the specific keyword number matching. I agreed; browsing the code I had found a case where I thought
only the first keyword would be checked. We have refactored this code in the Custom Refactoring project and believe our release fixes the issue.
Programmatic matching uses a curly-brace-delimited smalltalk block:
'{:node | block must return boolean: match expression?}
For example, to match any variable whose name begins with RB, use either
'{:node | node isVariable and: [RB* match: node name]}
or, using the standard meta-language to do the isVariable check,
'var '{ :node | 'RB*' match: node name}
(The Custom Refactoring project has added wildcard matching to expressions. If the Custom Refactoring code were loaded - and you were using underscore as the wildcard character - you could do the above with
'RB_
Obviously, block-matching remains very valuable in more complex cases.)
Another example: ANSI says that 3 - 4 should have a space before the four whereas VW does not require this. A rule to find all such cases is
('@a - '#literal) '{:node | node selectorParts first start = ('#literal start - 1)}
The parentheses above (i.e. the first pair of ordinary brackets) get parsed away but define an expression to which the block applies, just as, in the example further above, the name matching block applied to the preceding 'var expression. The expression matches the minus sign and so has to be a message node. John put a self halt in the above to show how they contrive to bind '#literal in the block to its value in the dictionary.
Search blocks can take two arguments: the second is a dictionary, which communicates between search and replace. You can reference the matched pattern variables (e.g. 'var) and/or just stuff something (#foo in the example below) into it. For example, to add the letter a to
SEARCH
'var '{ :node :dict | dict at: #foo put: 'a' , node name}
REPLACE
'{:dict | RBVariableNode named: (dict at: #foo)}
RBVariableNode appears because, to replace, you must return nodes, not just text from blocks. A replace block must return a node just as a search block must return a boolean.
The rewrite tool can be used to restrict the scope of a rewrite to just those packages, classes and methods you select (in VW7; in earlier / other
dialects, select applications / categories and classes, and, with Custom Refactoring code loaded, protocols and methods).
Any rules you add to the Smallint classes are automatically integrated into the Smalllint tool. Look at ParseTreeLintRule on the class side for examples of those rules that are based solely on parsing. Any method that appears in the five rule categories on the class side of ParseTreeLintRule will appear in the Smalllint tool (trivial hack on superclass' method will let you add a sixth category that will also be recognised). John demonstrated by adding a rule to turn isNil ifTrue: to ifNil:.
ParseTreeTransformationRule works much the same. Each item within these classes is an array in which the first value is the pattern for what to find and the second is the pattern for how to replace it. You can add to these lists or add your own rules.
Q. How find an expression that accepts one of four methods in given place? Use a block or write four rules (or load the Custom Refactoring project and use an OR separator - available from the right-click menu - between each of the four matching expressions).
Change objects: when you change the system you should have an object that represents that change. That's what the rewrite framework does. All subclasses of RefactoringChange know how to execute themselves. They also all implement asUndoOperation which returns the inverse change. They also effect the transactional behaviour. These changes are added to a CompositeRefactoryChange that is only executed after all is prepared.
To demo this, they wrote a script to create a CompositeRefactoryChange. If the changes are related, e.g. because performing them all will move the system from one consistent state to another, then a composite change is better than doing each change individually because the latter means you must undo them all individually if necessary.
changes := CompositeRefactoryChange named: 'testing'.
changes defineClass: '.... class def text'
changes defineClass: ....
....
Don explained that they have the RB's model, containing RBClass objects, and real classes are created after you execute the changes (alter RB model); and then effect the model into the image. Because the rewriter has this transactional model, it can (almost always) rewrite itself. The only code they cannot safely refactor is some code in these change classes. (They wish all of VW would use change objects; accept something in the VW debugger and suddenly the RB's undo goes grey since you can no longer trust that your undos are safe.)
They then wrote a script creating a composite change that renamed all their classes from RB* to BR*. John then called primitiveExecute on the composite instead of execute as the latter calls primitiveExecute and then adds the change to the RefactoringManager for effecting into the actual code. He then inspected the changes. Later, he did them and went on using the tool, as an example of the Refactoring Browser refactoring itself.
Lastly, John discussed using the tool in migration. The model helps this by using SmaCC (their generic parser utility) to either change and then export code without first recompiling it, or to import and change code before recompiling it. SmaCC is three times slower (when just parsing smalltalk) than the RB parser. SmaCC can do error handling much better than the RB. They are adding a new refactoring: Extract Block.
Q. Binding References? Matched as literals.