Experienced smalltalkers may skip this document.
This document does not deal with possible conflicts between
performance and elegance/reusablibiliy.
Be aware, that in some cases a compromise has to be made.
Below, we try to quantify performance aspects only, to give you
the background information required for coding decisions.
Algorithms in general
Most general programming concepts which affect a programs performance
also apply to Smalltalk code; typical algorithmic errors or performance
eaters will hurt a smalltalk program just like any other.
Some general rules (which apply for any programming language):
Before worrying about any specific (microscopic) performance tuning, always make certain that your general concept is correct and the best algorithms are used. For example, in most situations, a bubbleSort in assembler is slower than a quickSort written in a slow interpreted language, if the number of objects exceeds a certain limit. On the other hand, if only a few elements are to be sorted (say 5), even a naive sort algorithm will outperform a quickSort.
These algorithmic effects (and also choice of collection as seen below),
usually have a much higher impact on performance than microscopic
fine tuning.
Use existing methods
Before writing or rewriting any code, please have a look if a similar
or fitting functionality is not already present in the system.
Beginners tend to rewrite things which are already present
somewhere; especially collection methods tend to be overlooked and
functionality is reimplemented - often as suboptimal code.
For example, if you have to fill an area of a collection,
you may tend to write:
and ignore the fact, that a specialized method for doing this
already exists.
index1 to:index2 do:[:idx |
someCollection at:idx put:someValue
].
So, the above is better written as:
This not only saves code, but is also faster by roughly a factor
of 4
someCollection from:index1 to:index2 put:someValue.
(since #from:to:put:
is implemented as a highly
optimized primitive method).
This applies mostly to collections and streams.
How to find performance eaters / using MessageTally
The general rule that 80% of a programs execution time are spent in
20% of the code also applies to most smalltalk programs.
Before starting to scan all of your code and tuning every method,
you should first try to find the trouble spots and tune those.
Get a list of methods in which most of the execution time is spent,
and concentrate on the top items first.
To get this list, execute your program (or parts of it) under a messageTally.
It will give you a list which summarizes the execution times sorted
by method. Both the accumulated time (i.e. spent in a method and all
all of its callees) and the leaf time (i.e. the time spent in a particular
method alone) is reported.
Since the data is aquired by probing the execution
state of the program at regular time intervals, the information may not be exact
and in some cases be completely wrong (for example, if the program executes
in cycles which are in synch with the probe intervals).
This is of course a general problem
and always occurs when sampling with a (relatively) low frequency.
Despite its shortcomings, the information is generally quite
valueable to start with.
The default sampling occurs in 10ms ticks, but the MessageTally
class offers an interface where the sampling interval can be specified as
argument.
For reasonable statistics, the program should at least execute for a second;
the longer, the better.
For example, the output of:
is:
|data rnd|
data := Array new:10000.
rnd := Random new.
1 to:10000 do:[:i |
data at:i put:(rnd nextIntegerBetween:1 and:100000)
].
MessageTally spyOn:[
10 timesRepeat:[data sort].
]
total execution time: 2189 ms (108 probes ; error >= 0.9%) MessageTally>>MessageTally execute (total 100.0%) Array>>SequenceableCollection sort (total 100.0%) Array>>SequenceableCollection quickSortFrom:to: (total 100.0%)(leaf 8.3%) Array>>SequenceableCollection quickSortFrom:to: (total 91.7%)(leaf 9.3%) Array>>SequenceableCollection quickSortFrom:to: (total 82.4%)(leaf 1.9%) Array>>SequenceableCollection quickSortFrom:to: (total 80.6%)(leaf 0.9%) Array>>SequenceableCollection quickSortFrom:to: (total 79.6%)(leaf 8.3%) Array>>SequenceableCollection quickSortFrom:to: (total 71.3%)(leaf 8.3%) Array>>SequenceableCollection quickSortFrom:to: (total 63.0%)(leaf 4.6%) Array>>SequenceableCollection quickSortFrom:to: (total 58.3%)(leaf 9.3%) Array>>SequenceableCollection quickSortFrom:to: (total 49.1%)(leaf 17.6%) Array>>SequenceableCollection quickSortFrom:to: (total 31.5%)(leaf 7.4%) Array>>SequenceableCollection quickSortFrom:to: (total 24.1%)(leaf 11.1%) Array>>SequenceableCollection quickSortFrom:to: (total 13.0%)(leaf 5.6%) Array>>SequenceableCollection quickSortFrom:to: (total 7.4%)(leaf 1.9%) Array>>SequenceableCollection quickSortFrom:to: (total 5.6%)(leaf 0.9%) Array>>SequenceableCollection quickSortFrom:to: (total 4.6%)(leaf 1.9%) Array>>SequenceableCollection quickSortFrom:to: (total 2.8%) Array>>SequenceableCollection quickSortFrom:to: (total 2.8%)(leaf 0.9%) Array>>SequenceableCollection quickSortFrom:to: (total 1.9%) Array>>SequenceableCollection quickSortFrom:to: (total 1.9%) Array>>SequenceableCollection quickSortFrom:to: (total 1.9%)(leaf 0.9%) Array>>SequenceableCollection quickSortFrom:to: (total 0.9%) Array>>SequenceableCollection quickSortFrom:to: (total 0.9%) Array>>SequenceableCollection quickSortFrom:to: (total 0.9%)(leaf 0.9%) leafs of calling tree: Array>>SequenceableCollection quickSortFrom:to: (total 100.0%)(leaf 100.0%)Of course, this is what we expected; all time is spent in the quickSort method (which recursively recalls itself).
The execution time as reported is not the same when the program is executed
without a messageTally; the messageTally itself introduces some overhead,
slowing down the execution
(during its execution, it builds a call tree which may become quite large).
In most situations, this does not make a problem - the relative
numbers as reported are independent of the absolute execution time.
(exception: if you want to measure execution time of some program which
depends on real time.)
The number of probes as reported gives a hint on how exact the statistic
data is. With more probes, the measured numbers become more exact.
If the number of probes is too small, you should execute
the program in a loop, or use a messageTally with a smaller probe tick-time.
In the following, the execution time is too short for a reasonable measure:
because only a few probes occur.
|data rnd|
data := Array new:2000.
rnd := Random new.
1 to:2000 do:[:i |
data at:i put:(rnd nextIntegerBetween:1 and:100000)
].
MessageTally spyOn:[
data sort.
]
compare the above to:
this samples in 1ms intervals.
|data rnd|
data := Array new:2000.
rnd := Random new.
1 to:2000 do:[:i |
data at:i put:(rnd nextIntegerBetween:1 and:100000)
].
MessageTally spyDetailedOn:[
data sort.
]
Notice that not all operatingSystems support timer intervals this short;
on some, the resolution may be no better than 20ms.
If you cannot run your code under a messageTally (for example, because
it is an interrupt handler, or synchronized by the timer),
try to find the executions times by placing parts of your code into
a timing block,
as in:
If it is not possible to print the results (because the printing
itself disturbs the operation & timing), keep the time values
in some collection which is printed later, at uncritical times.
...
t1 := Time millisecondsToRun:[
... some code here ...
].
...
t2 := Time millisecondsToRun:[
... more code here ...
].
...
'time spent in first block: ' print. t1 printNL.
'time spent in first block: ' print. t2 printNL.
Use of collections
Choose an appropriate class
Although it is very convenient to have a consistent protocol within the
collection hierarchy, the cost of a particular operation is hidden
somewhere in the implementation, and may not be obvious at first sight.
For example, the time required to add an element to a collection varies
with the type of collection involved ("Array add:"
is much more expensive than "anOrderedCollection add:"
).
It may be hard to "see" this cost,
when reading code which gets some collection as argument and operates on
it.
Check out the use of collections and consider replacing them by others which offer better performance. There is no ``general perfect collection'' which satisfies all needs; performance depends heavily on the way a collection is used.
You should at least gather a rough idea of how collections are implemented
internally, to get a feeling of which operations are to be considered
``dangerous'' with respect to performance. This may be somewhat in
conflict to the general ``black box'' or ``need to know''
approach, but experience shows,
that most mistakes are due to misuse instead of proper reuse.
After all, thats what a browser and the full source
code are provided for with the system.
One typical situation is when elements are searched for in a collection. Many collections search time grows linear with the number of elements (Array, OrderedCollection, LinkedList etc.). If the collection in question is large, replace arrays or orderedCollections by sortedCollections if the elements can be compared (log-n searchtime), or sets/dictionaries (almost constant searchtime).
Of course, there is the usual time/space tradeOf; hashed collections require some additional space (typically 20-30%), binary trees require 2 additional reference entries etc.
Be aware, that some collections are slower than others,
when adding elements (SortedCollection vs. OrderedCollection),
or that the time may depend on whether the elements come along
in ascending order or in random order (SortedCollection).
In some situations, it makes sense to build up
a collection as an OrderedCollection, Set or Dictionary,
and convert it as appropriate when all elements have been added,
for shorter search times later.
If many elements are to be added to a SortedCollection,
it may be preferable to first add them unsorted (i.e. build it as an
OrderedCollection) and sort the elements once at the end;
instead of letting the collection sort each new element individually.
For example:
is about 4 times slower than:
|randomValues rnd s|
rnd := Random new.
randomValues := (1 to:10000) collect:[:i | rnd next].
s := SortedCollection new.
Time millisecondsToRun:[
randomValues do:[:v | s add:v]
]
or:
...
Time millisecondsToRun:[
s := randomValues asSortedCollection
]
which uses the same algorithm internally.
...
Time millisecondsToRun:[
s addAll:randomValues
]
In contrast, if we add elements in presorted order,
as in:
the difference drops to roughly a factor of 1.6
(but adding them all at once is still faster)
|presortedValues s|
presortedValues := (1 to:10000).
s := SortedCollection new.
Time millisecondsToRun:[
presortedValues do:[:v | s add:v]
]
Lets try, analyze and measure a concrete example with searching;
the following (naive) code reads a file and
builds up a list of unique words:
Executing this shows very poor performance, and you may ask yourself: why ?
|data stream line words|
data := OrderedCollection new.
stream := 'Makefile' asFilename readStream.
[stream atEnd] whileFalse:[
line := stream nextLine.
words := line asCollectionOfWords.
words do:[:word |
(data includes:word) ifFalse:[
data add:word
]
]
].
stream close
At first, you may think that the splitting of lines into individual words
produces too much overhead, or that temporary storage allocation and
reclamation is the problem.
Before starting to replace #asCollectionOfWords
by an inline
scanning loop, either think about what may happen, or use a messageTally,
to see what really happens.
OrderedCollection>>includes:
").
We used an unappropriate collection for this type of search, leading to a square runtime behavior. Such an algorithms execution time can grow quite rapidly, in our case (with roughly 2200 unique words out of a total of 8600 words), this accounts for millions of string compare operations.
Of course, we would like our algorithm to show an execution time
which grows linear with the size of the input, (possibly) unaffected by the
number of unique words.
Replacing the OrderedCollection by a Set accomplishes this:
and speeds up the program by some factor of 10.
|data stream line words|
data := Set new.
stream := 'Makefile' asFilename readStream.
[stream atEnd] whileFalse:[
line := stream nextLine.
words := line asCollectionOfWords.
words do:[:word |
(data includes:word) ifFalse:[
data add:word
]
]
].
stream close. data
|list|
list := Array new.
1 to:10000 do:[:i |
list := list copyWith:i
]
creates 9999 intermediate throwAway arrays of increasing size (0 to 9999).
All those temporaries account for a rough total of 200Mb garbage,
which produces quite a bit of overhead
in creation, initialization (nilling) and reclamation.
The same argument is true for the , (comma) message, which creates
new collections.
I.e.:
allocates huge amounts of temporary strings (which are immediately thrown away).
|string|
string := ''.
1 to:10000 do:[:i |
string := string , (i printString)
]
To avoid this, use a streams to collect the data:
this is much faster, because the stream uses a buffer, which is doubled in
size when full, thus only performing log2(N) allocations, for a final string size of N.
|stream string|
stream := WriteStream on:''.
1 to:10000 do:[:i |
stream nextPutAll:(i printString).
].
string := stream contents.
Preallocate collections
If you know in advance, how big the collection is going to be,
preallocate a big one, and store elements into it:
this takes only 90ms on the same machine (with interpreted code)
and will boost to 4ms, if compiled to machine code.
|list|
list := Array new:10000.
1 to:10000 do:[:i |
list at:i put:i
]
You may prefer to write the ``smalltalkers'' code:
which does exactly the same as above (but is considered better coding style).
list := (1 to:10000) asArray
Convert after building
If you do not know the final size in advance, either use a collection which
is prepared to grow and convert it when done:
or a writeStream, which is also tuned for growing its underlying collection:
|list temp|
temp := OrderedCollection new.
1 to:10000 do:[:i |
temp add:i
].
list := temp asArray
on the same hardware, the last two versions both execute in about
110ms if interpreted or about 35ms if compiled.
|list stream|
stream := WriteStream on:(Array new).
1 to:10000 do:[:i |
stream nextPut:i
].
list := stream contents
All of the above is also true for the #','
(comma) concatenation message,
which is often used with string objects.
Especially when building strings, use of a writeStream is generally the better choice.
Avoid removing inner elements from OrderedCollections
OrderedCollections, which keeps their elements in a dense internal array
typically show poor performance when the collection
is large and elements are removed in the middle (i.e. not at either end).
The reason is that to remove an inner element, (all) followup elements
in the underlying container have to be copied towards the removeIndex.
In contrast, removing at either end is much cheaper, since only the startIndex
or lastIndex instance variable has to be modified and no block copying
is needed.
If you have to remove many elements from a huge orderedCollection, it may be faster (depending on the size), to create a new collection from scratch - even though this may involve some garbage collector overhead.
For example, to remove all odd elements from a 10000 element collection,
the following code:
takes around one second on a P5/133.
coll removeAllSuchThat:[:el | el even].
If we do not trust #removeAllSuchThat:
,
and do things ourself as in:
things become a bit faster (around 530 ms).
coll size to:1 by:-1 do:[:index |
(coll at:index) even ifTrue:[
coll removeAtIndex:index
]
]
(Notice, that we do things reverse - both to not confuse our indices
when removing AND for a slight speed advantage, since less copying
is required.
This is also faster, since #removeAllSuchThat: returns a new
collection of removed items, which we do not need here.).
The fastest here is to create a fresh collection as in:
executes in about 64 ms on the same machine.
coll := coll select:[:el | el odd].
The same is true for SortedCollection
, which is based
upon OrderedCollection
.
Notice, that in ST/X,
the Set
and Dictionary
classes use a special
removal algorithm, to NOT suffer from the above problem.
Other smalltalk systems may show poor performance with Sets and Dictionaries
as well, due to excessive rehashing when elements are removed.
Avoid LinkedLists
Although many algorithm textBooks spend much time
in describing them, linkedLists are among the slowest collections and
should not be used.
In practice, there are hardly any situations in which they perform better than
orderedCollections, queues or fixed arrays.
If you restrict yourself to adding/removing individual elements at either
end, linkedLists may be sufficient
(but please check; usually orderedCollections
show a better performance and require less memory overhead; even
for the above mentioned operations).
In any case, avoid indexed accesses on a linkedList.
For protocol completeness, indexed access is provided by the #at:
method, but is internally implemented by walking from the
lists anchor up to the specified element.
Some methods inherited by LinkedList walk over the receiver
doing indexed accesses - these show square runtime behavior,
when applied to linkedLists.
The worst performance is encountered when walking over a linked list
in reverse order - this is terribly slow.
Of course, it would be easy to rewrite these methods, to avoid this behavior.
However, this was not done, since LinkedLists are rarely used in practice.
(if you insist on using linkedLists, and you need those methods, please
define a private subclass of LinkedList and add tuned implementations of
those algorithms to that class; top candidates are:
#collect:
, #select:
and all other methods which
are inherited by SequentialCollection and walk over the receivers element
using #at:
to fetch elements.)
Also notice, that linkedLists require that their elements inherit from
Link
, or provide a similar protocol.
Thus, linkedLists are limited in the elements type, while
other collections work with any object.
Finally, linkedLists are dangerous in another area: its elements can only be on
one linkedList at any time. If you add an element to another list, the original
list will be corrupted if it was not removed from it before.
This means, that we highly recommend using ValueLink
elements,
instead of placing your objects directly into a linkedList.
Arrays
Fix Arrays show the best access performance, but cannot grow and
are therefore harder to create, if the final size is not known in advance.
If your program has to build up some collection which stays fixed
in size afterwards (but its not known in advance, how big it is
going to be), build it as an OrderedCollection and convert to an Array
when done (if this slight speed advantage is really required).
Alternatively, use a WriteStream and take its contents at the end.
In most smalltalk implementations, arrays cannot grow in size.
(I.e. "someArray add:anElement"
is not possible).
In ST/X, arrays support growing for protocol completeness,
but since this operation is a very slow one,
a warning message is sent to the the terminal.
As an example,
may take forever, due to the very expensive resizing in
|a|
a := Array new.
1 to:100000 do:[:i |
a add:i
]
#add:
.
(the corresponding code using an orderedCollection is more than 1000 times faster)
Notice: the real reason for Array>>add:
being so slow is that the
underlying basic mechanism (#become:
) is slow in many
smalltalk implementations. Read more on this below.
Use Sets and Dictionaries
Most beginners tend to ignore sets and dictionaries.
Instead, they often use arrays and implement all kinds of loops,
processing them in the applications methods.
(especially: long time C-users tend to do this ;-)
In general, this will require more code and be slower than code using a
dictionary (except for a very small number of elements or if a numeric
index can be computed).
Sets are perfect for inclusion tests or to find unique values.
Always consider using a set, if you have to search the collection using
#includes:
.
Bags are great for counting elements.
If your access keys are integers or symbols, use identitySets or identityDictionaries. These are slightly faster in most smalltalk implementations.
For sparse arrays, a dictionary will do (there is also a class found in the
goody-directory which implements sparseArrays on top of a dictionary).
Comparing collections
Be careful when comparing collections using #=
;
the compare method will recursively walk down the receiver and
the argument and compare elements (which may be other collections).
Although being a useful and powerful operation,
this may be quite slow for big collections (think of huge trees,
or arrays of arrays etc).
If you have to compare collections (searching game trees, pattern matching etc.)
you can avoid useless compares,
by adding some hashKey or signature to your collections,
and check these first for a fast trivial reject.
Since sets and dictionaries limit the number of compares when searching for
entries, there is not specific performance penalty when this kind of
object is stored into them. However, make certain, that your elements
provide a good hash value (some simply inherit the #hash
method
from Object
, which is only good for hashing on identity).
If you have a collection of mixed type objects, and want to search for some specific element, think about this comparison cost. It may be better to use separate containers or somehow tag elements.
When redefining the hash method, do not make its algorithm too expensive.
(Sets and dictionaries ask for the hashKey with every
add:
, remove:
, includes:
etc.)
You may have to compare the comparison time with the time it takes to
generate the hash value, and make a ``good'' compromise.
Bulk byte valued data
The following applies to ST/X only:
If you need big byteArrays (for example, in image processing),
you often have to allocate huge byteArrays, which are initialized with
pixel data soon after allocation.
In all smalltalk memory systems, a major fraction of an objects creation
cost is the time required to initialize the new object (for normal objects,
the entries are nilled; for byteArrays, they are set to zero).
Of course, for normal objects, the nilling is required (otherwise, you could
get invalid object references), but for byteArrays, it is possible to create
them and not caring for the initial values.
This is done by:
which is similar to the regular
...
data := ByteArray uninitializedNew:size
...
#basicNew:
, but does not
set the elements to zero (instead, you will find random values in the byteArray).
This is considerably faster;
the overall cost to allocate and (automatically) reclaim a 1000 element
byteArray is roughly 4 microseconds vs. 28 microseconds.
BTW: for huge truth-valued collections, it may make sense to define a
BooleanArray as a variableByteSubclass, and redefine the #at:
and #at:put:
methods to return boolean values instead of
numeric ones. This can save considerable memory (a factor of four of you
store one boolean per byte, or a factor of 32 if you compress 8 booleans
into one byte).
You will find such a BooleanArray implementation in the goody directory.
Use of number classes
Like with collections, some numeric operations perform much slower,
depending on which number representation is involved, and dramatic
performance degration can be the result, if inappropriate types
are involved or excessive conversions are performed.
Avoid fractions
Fractions keep the numerator and denominator internally.
What makes them unperformant in some situations is that
they always reduce themself when created as a result of a numeric
operation.
This reduction is done by computing the greatest common divisor of the numerator
and denominator, which can take some time; especially, if any of them
is a LargeInteger.
It may be better to convert numbers to floats if this happens to be a problem.
Since in practice, fractions are relatively seldom used, no specific tuning has been done, to fix this. If there is a need, future releases may include a modified fraction class, which performs the above lazyly (i.e. keeps the values unreduced as long as possible).
BTW: in ST/X and possibly other smalltalk implementations as well,
the LargeInteger
class is provided
mainly for completeness; it has not been written or
tuned with speed as a primary goal.
Floating point arithmetic
Floating point arithmetic creates new Float instances for every
result. Therefore, the use of floating point numbers creates additional memory
management overhead (this is also true for largeIntegers and fractions).
For example,
the execution time of the following is a consequence of memory management
overhead; not of floating point arithmetic:
The above creates a million new floating point objects (which are, except for
the last one, discarded and reclaimed by the garbage collector).
All in all, the above creates and reclaimes about 20Mb of garbage.
|sum|
sum := 1.0.
1 to:1000000 do:[:i |
sum := sum + 1
]
The cost of allocating and collecting a float is in the order of 1-2 microseconds.
SmallIntegers do not suffer from this problem.
Bulk numeric data
If your applications requires big collections of floats,
use instances of FloatArray or DoubleArray to keep them.
They store the values directly (instead of a reference to the number object)
and thereby reduce storage requirements and garbage collector overhead.
A 1000000-element FloatArray requires roughly 4Mb of memory, while a
regular 1mio-element Array requires roughtly 24Mb of memory, since each float
is represented by an individual object.
FloatArray and DoubleArray may also be subclassed (even with additional named instance variables) for special requirements.
Typical applications are matrix or vector classes.
Add primitive code for heavy duty operations (like matrix inversions,
matrix multiplication etc.).
These primitives can be written in a way to avoid any temporary garbage,
which is not possible if regular floats are used.
Weakarrays produce additional GC overhead
In the current (ST/X) implementation of the garbage collector,
the processing of weakArrays creates additional overhead
which is due to the scanning of all objects which contain weak pointers
after every scavenge and sweep operation.
Measurements show roughly some overhead of 3-5 microseconds per weak object
(in other words: if an empty scavenge takes 500us without any weak object,
expect some 3-5ms scavenge time, if there are 1000 weak arrays in the systems).
Try to avoid creating many weakArrays. One reason for having many shadow objects created is the default dependency mechanism (i.e. non-Models). This creates a small weakArray for every object which has at least one dependent.
This is bad news, since using weakArrays for the dependency mechanism makes
pretty much sense - it allows you to forget about releasing dependencies.
(other systems use regular arrays for dependencies, and you have to manually
release these in order to allow the garbage collector to free objects.
Without a release, objects are kept from being freed by having a dependent.)
In practice, weakArray overhead is less of a problem,
since most dependencies are for models or objects derived from the Model class.
These have a redefined dependency mechanism, which is not based on weakArrays.
Future implementations may be rewritten, to reduce or even avoid this
additional overhead.
Micro tuning
Some of this section is meant for advanced smalltalk programmers.
Beginners should not concentrate on all those details.
There are certain smalltalk specific aspects, which may be considered when tuning code. The following describes various microscopic effects. Do not expect as dramatic speedups as in the above examples. However, these microscopic effects may accumulate to a macroscopic effect, especially when present in loops which are executed many times.
Often, there is a conflict between the desire to write
flexible, reusable and elegant code on one side, and the performance
aspect on the other. We will not discuss this in the following, but instead
simply list (and quantify) things.
Using isKindOf: is bad style, and not even faster
As described in various other documents, there are many reasons for banning
the use of the #isKindOf:
message.
However, many (beginners) think that it
makes sense from a performance aspect; this is not true.
The #isKindOf:
method has to get the receivers class
and follow that classes superclass chain
way up to the top in the false case,
and up to the checked class in the true case.
In any case, this is one message send PLUS the message chain walk.
For average superclass chains, #isKindOf:
executes in the order of a microsecond (on my machine),
but it can take longer for deeply nested class hierarchies.
It is always better to write a check method, which returns false in
the default case and true for instances of your desired class.
I.e., instead of writing
"foo isKindOf:MyClass
"
you should
implement two methods:
in Object:
or:
respondsToWhatINeed
^ false
and in your class:
isMyClass
^ false
or:
respondsToWhatINeed
^ true
and write "
isMyClass
^ true
foo respondsToWhatINeed
" or "foo isMyClass
".
This will always execute in a time shorter or equal to the #isKindOf:
time
(a single messageSend is required,
which takes some 300ns on the same machine)
and is more flexible, since you may implement
this check method in other classes, without a need to stay in a particular
subclass hierarchy.
If looking into the Object
classes protocol,
you will already find a bunch of these check methods
- #respondsToArithmetic
, #isStream
, #isString
and so on.
Measurements to prove the above (on SGI-Indy):
'hello' isKindOf:Number 1120ns (has to walk up for a miss)
1 isKindOf:Number 785ns (has to walk up 2 classes for a hit)
1.0 isKindOf:Number 787ns (has to walk up 2 classes for a hit)
'hello' isNumber 305ns (immediate return false in Object)
1 isNumber 305ns (immediate return true in Number)
Instvar access vs. self sends
From the programmers point of view, it is better to access instance variables
within a method via self messages than accessing them directly.
This gives more flexibility when
a class may eventually be subclassed, and the access method be redefined to
return a computed or modified value.
Performance wise, it is better to access instance variables directly.
The cost of an instance variable access is in the order of 20-70ns, while
a message send returning an instance variable takes some 200-500ns.
Example:
is slightly slower than:
Object subclass:MyClass
...
instanceVariableNames:'foo bar'
...
!MyClass methodsFor:'accessing'!
foo
^ foo
!
...
example
...
y := self foo.
...
!
but of course, less flexible.
...
example
...
y := foo.
...
!
...
Despite the fact, that direct access is faster,
you should not ignore the maintenance and reusability issues.
Quite a few methods had to be rewritten in the past, to use self sends
instead of direct instVar accesses
(to allow things to be redefined in added subclasses).
Care for this "optimization" only after all other tuning fails.
Never "optimize" classes this way, which are intended as abstract or
framework classes. Only consider it in places which are executed
millions of times in a loop (better: move the self send out of the loop).
Variable access
In nested blocks, variables of the inner scopes are accessed faster than
outer variables.
For example,
is slightly slower than:
|tmp|
1000 timesRepeat:[
...
tmp := ....
...
]
]
Also, local variable access is faster than globals, class variable or
instance variable access.
1000 timesRepeat:[
|tmp|
...
tmp := ....
...
]
]
However, the performance difference is not a dramatic one, the difference
is in the order of 20-70 nanoseconds per access (due to an added indirection
in the memory access).
Beside the slight speed advantage, block locals also remove the danger of
side effects between temporary variables within a block. Use of them can
be considered "good style"
This may not be true for some (older?)
versions of Smalltalk/V, especially in systems
where block locals are technically
implemented as method locals and are shared between the method and its blocks.
Blocks
The creation of blocks may be an expensive operation.
Especially, if the block
references locals from its home context (so called ``full blocks'')
AND survives its creating method context, additional CPU overhead is produced
since a full context must be instantiated (blocks are really closures).
ST/X performs context allocation lazyly, in 3 steps:
Blocks which only reference globals, class variables, block arguments or inner block locals are created once at compilation time and reused for every invocation (so called ``cheap blocks''). These do not refer to the home context.
Currently, ST/X treats blocks which access read-only local variables (so called ``copying blocks'') just like full blocks. Later versions will handle these cases as well, and their cost will be somewhere in between cheap blocks and full blocks.
Try to avoid full blocks, if possible.
For example:
creates a full block, since the outer variable i is accessed
(and even modified) within the block.
|i|
i := 1.
someArray do:[:element |
... do something with element and i ...
i := i + 1.
]
The alternative:
is shorter and faster in execution, since the compiler can
create a cheap or copying block in many situations.
someArray keysAndValuesDo:[:i :element |
... do something with element and i ...
]
The following applies to full and copying blocks only.
All enumeration messages expect a block argument;
if loops are nested, try to have the inner loop(s) be executed more often
than the outer ones. For example:
creates 10 garbage blocks (*), while:
1 to:10 do:[:i |
1 to:1000 do:[:j |
...
]
]
creates 1000 garbage blocks when executed (*).
1 to:1000 do:[:j |
1 to:10 do:[:i |
...
]
]
Note: (*)
Actually, in the above concrete example, this is not really true, since
the stc-compiler inlines to:do:
messages,
iff its type analyzer can find out that the loop bounds are both integers.
Inlined loops do not create any garbage blocks at all.
Block creation cost is in the microsecond order for full blocks.
Block invocation time is between 150 (R4000/P5) and 500ns (485/50).
The cost of instantiating surviving contexts (i.e. an accessable blocks home)
is also in the microsecond order and also creates additional GC overhead.
Common subexpressions
In smalltalk, the compiler can generally not merge
common subexpressions, as is done by other programming languages' compilers.
The reason is that in general, the compiler does not know about the receiver
of a message and therefore does not know what side effect is introduced
by some expression (even if it did, it could not make use of that knowledge in many situations,
since the receiver could be changed by some other thread or an interrupt
handler). This may change in future systems, if Self-style
dynamic recompilation is implemented.
However, no available smalltalk system
seems to perform these optimizations today.
Consider the following code fragment:
here, "
...
i := 1.
[i < aCollection size] whileTrue:[
...
do something with aCollection
...
]
aCollection size
" is evaluated over and over,
since the compiler cannot know (in general) if the collections size changes
within the loop (actually, it does not even know about the meaning of the
size message; it does not even know that aCollection is some kind of
collection).
For most collections, the size message is pretty cheap, but an additional
message send (which is technically a function call) is performed for every
loop invocation.
If you know that the collections size is invariant, rewrite the above to:
or (better):
|sz|
...
i := 1.
sz := aCollectin size.
[i < sz] whileTrue:[
...
do something with aCollection
...
]
...
of course, all of the above can be considered ``bad smalltalk style''.
A more general solution (if you don't modify the collection) is to write:
...
1 to:aCollectin size do:[:i |
...
do something with aCollection
...
]
...
since it can operate on any type of collection, while the above examples
only work with sequencable collections (those that use a numeric index).
...
aCollection do:[:element |
...
do something with element
...
]
...
The same is true for arithmetic expressions, indexed collection access etc.
I.e. instead of:
use:
...
... aCollection at:index+1 ...
...
... aCollection at:index+1 ...
...
or (if possible) reuse the element:
|nextIndex|
...
nextIndex := index + 1.
...
... aCollection at:nextIndex ...
...
... aCollection at:nextIndex ...
...
Remark:
|nextElement|
...
... (nextElement := aCollection at:index + 1) ...
...
... nextElement ...
...
We could argue on the question if this kind of ``manual common expression elimination´´
is ``bad coding style'' .... however, smalltalk was designed to offer so much freedom
in what may be done in a method (and how threads interact), that the compiler really
often has a hard time to optimize things AND keep the code correct for all those
cases.
Use and: ( or: ) instead of & ( | )
Many programmers tend to use "&
" (ambersand) for logical
and, where they could (should) use "and:
".
Since "&
" always evaluates its right expression (even if the left
already returned false), the evaluation of conditional expressions
can usually be speeded up by using "and:
" and
arranging logical expressions in a way that the
minimum number of evaluations is required.
I.e., instead of:
(which evaluates both conditions in any case),
it is better, to write:
...
condition1 & condition2 ifTrue:[
...
].
...
which only evaluates condition2 if the first returned true.
...
(condition1 and:[condition2]) ifTrue:[
...
].
...
Of course, there are rare situations, where it is intented that the
right expression is evaluated for its side effect and "&
" should
be used.
The same is true for "|
" vs. "or:
".
Notice, that the above may not be true for systems which do not create inline code for those blocks, but all modern smalltalks do so (ST/X does).
Arrange the conditions that the condition can be computed
with a minimum number of subexpression evaluations.
I.e. (if the computation of the subconditions is expensive)
instead of:
better write:
...
(mostlyFalse or:[mostlyTrue]) ifTrue:[
...
].
...
Of course, this is only possible, if the second subcondition does not
depend on side effects which result from the first subconditions evaluation.
...
(mostlyTrue or:[mostlyFalse]) ifTrue:[
...
].
...
Comparing integers and symbols
Small integers and symbols can be compared
using #==
instead of #=
.
For #=
, there is no speed difference if the two objects are identical, since
the compiler produces inline code which checks for identity in any case.
However, if the objects are not identical, a regular message send to the #=
method is executed, which takes some 300-500ns.
(in contrast to the inline compare, which is in the order of 20-50ns).
However, for integers, you must be certain that the values are in the SmallInteger
range (i.e. -2^30 to +2^30-1).
Otherwise, the compare will always fail, since the other number classes
(especially: LargeInteger) never compare identical.
For symbols, #==
is a pointer compare, while #=
is effectively a string compare.
On the other hand, be careful, to never compare strings using #==
;
this will usually return false.
This (performance) argument is in opposite to the coding style argument; you must be certain about the objects type (i.e. SmallInteger) for this "optimization" to be valid.
If comparing different objects to one constant object in a loop,
use the constant object as receiver;
i.e. instead of:
better write:
|foo|
...
collection do:[:element |
element = foo ifTrue:[
...
]
].
...
which it is (slightly) faster in most smalltalks.
(the reason lies deeply burried in the message caching mechanism).
|foo|
...
collection do:[:element |
foo = element ifTrue:[
...
]
].
...
'to:do:' vs. '(to:) do:'
Be aware, that there is a slight difference between:
and:
(1 to:10) do:[:i |
...
]
the first example creates a temporary interval object (via
1 to:10 do:[:i |
...
]
#to:
) and
then loops on this object (via #do:
).
The second directly sends #to:do:
to an integer number.
Therefore, the second example requires only 1 message send, and does not create a new temporary object (and is therefore faster by a few microseconds).
The second version may also be compiled into faster code,
iff the compiler can somehow find out (or gets a hint)
that the the receiver and argument of the #to:do:
message are
integers. If this is the case, fast inline code is generated for the loop
(in the above example, this is certainly true).
Comparing to nil, true and false
Comparing some object for being nil/nonNil is faster if done
with #isNil
, #notNil
, #==
or #~~
instead of #=
or #~=
.
The same applies to comparing to false
or true
,
although these compares are mostly implicit, by messages like #ifTrue:
or #ifFalse:
.
I.e. instead of:
or (often found in old ST-80 code "for performance"):
something = nil ifTrue:[
...
]
better write:
nil = something ifTrue:[
...
]
or:
something isNil ifTrue:[
...
]
The above are equivalent, since no other object may (should) return true
for a nil compare.
Therefore replacing the value compare by an identity
compare should always be legal.
something == nil ifTrue:[
...
]
(however, since the compiler cannot not know about any wierd definitions,
this optimization is not done automatically.)
Smalltalk/X specific micro tuning
The remaining sections describe various Smalltalk/X
specific performance improvements.
Avoid #become:
ST/X represents objects by a direct pointer to the
underlying data structure, in contrast to some other smalltalk implementations,
which use an indirect reference through an ObjectTable.
This implies, that ST/X's implementation of #become:
has to search through all objects and replace all exchange all (pointer-)
references. Although #become:
has been optimized and can do
quite fast in many cases (if the objects are of same size or if they
are located in the newSpace), the worst case (different sizes and
either in oldSpace) involves a search through all memory and executes
in the order of 100ms. In large applications and/or if paging is involved,
it can take up to a second or more.
All collection classes in ST/X have been designed to avoid
#become:
in their grow operations - this is done by having
the variable part of the collection in an array which is kept in an instance
variable (instead of having it in the objects indexed part).
All direct pointer smalltalk implementations have this problem in common -
so the above is also true for the Squak smalltalk system (among others).
Type hints
In ST/X, it is possible to give the compiler a hint
of which type a variable is expected to be bound to at run time.
This forces the compiler to create additional check code when this variable
is bound to a value and ommit checks or message sends when it is referred to.
Currently, type hints are ignored for anything but
method and block local variables.
Also, (currently) only SmallInteger and Float type hints are effective.
For variables which are known to have a certain type,
faster code is generated, since many message sends or checks can be
done as fast inline code.
(BTW: the compiler does some type analysis anyway, which makes the need for
a type hint obsolete in many cases).
The following message sends are inline coded (and therefore much faster) if the receivers and/or arguments types are known to be integer:
#+
, #-
, #*
, #//
and #\\
#between:and:
#asInteger
, #asFloat
, #rounded
and #truncated
#abs
, #negated
#timesRepeat:
, #to:do:
and #to:by:do:
#at:
and #at:put:
if the compiler ``feels'' that the receiver is propably an array or string
You should use typehints only in well debugged, performance critical methods.
You will find methods which make good use of type hints in the Image class - especially where bulk pixel data is processed, type hints can speedup these operations by factors.
Copyright © 1995 Claus Gittinger Development & Consulting
<cg@exept.de>