Becoming More Effective with C++ … Day Two

Becoming More Effective with C++ … Day Two Stanley B. Lippman stan.lippman@gmail.com

Generic Programming

Generic Programminga definition • There are two general models of supporting multiple types for an invariant code base – that is, for what is referred to as genericprogramming. • The Universal Type Container Model Strip away the type information associated with the object. This is a simple mechanism to support because it is reductive. All objects are stored in a uniform and opaque manner. In a language like C, this universal container is void*, which is type-less. In a language like C#, it is Object.

A C++ .NET Example interface class IEnumerator { property ??? Current { ??? get(); } bool MoveNext(); void Reset(); } • The difficulty of this declaration is the need to specify the type associated with the Current property. • This is the only aspect of the interface signature that is likely to vary across implementations.

The Universal Type Container Model interface class IEnumerator { property Object^ Current { Object^ get(); } // … } • Object serves as a universal placeholder because every type is a kind of Object; this gives us one degree of separation. • The problem with this solution is that one must explicitly downcast to manipulate the object and this is either unsafe or inefficient.

In Some Contexts, It Works Well … void iterate( Object^ o ) { if ( IEnumerable^ e = dynamic_cast<IEnumerable^>(o)) { IEnumerator^ ie = e->GetEnumerator(); while ( ie->MoveNext() ) { Object^ elem = ie->Current; • This handles any and every CLI type – and an impressive and inexpensive generic mechanism … • That is, it works very well when we don’t need to manipulate the actual type …

In Other Contexts, Not So Well … void iterate( Fibonacci^ fib ) { Object^ o = fib->GetEnumerator()); NumericSeqEnumerator ^nse = static_cast<NumericSeqEnumerator^>(o); while ( nse->MoveNext() ){ int elem = static_cast<int>( nse->Current ); // … • The Object solution handles every type well, but does not handle any particular type quite as well … • It may prove of significant expense to iterate across our elements if each access requires unboxing …

Generic Programminga definition • There are two general models of supporting multiple types for an invariant code base – that is, for what is referred to as genericprogramming. • The Type Parameter Model Factor out and delay the binding of type information associated with the object. This follows the model of a function in which values that may vary from one invocation to the next are factored out into parameters. For this reason, this model is referred to generally as parameterizedtypes. It is a more complex mechanism to support.

The Type Parameter Model • Object provided us with one degree of separation. • Type parameterization provides us with an additional degree of separation. • It allows us to identify a type-neutral placeholder that will later be bound to a user-specified type. interface class IEnumerator { property WildCard Current { WildCard get(); } // … }

Syntactic Scaffolding … • The next problem is how do we indicate to the compiler (the machine reader of our program) that WildCard is a placeholder and not a real program entity? • In C++, this is done through a parameter list introduced by a reserved keyword. template <typename WildCard> interface class IEnumerator { property WildCard Current { WildCard get(); } // … }

Parameterized Type Support in C++ • The C++ language supports generic programming through its template mechanism. • The general definition begins with a keyword introducing the mechanism to be used: template. • This is followed by a parameterlist enclosed in a bracket pair (<,>). • Following this is the actual type definition, which can be a class or function. • Within this definition, the parameters serve as placeholders for actual types later supplied by the user.

The Canonical Template Stack template <class elemType> class Stack { vector<elemType> ^m_stack; int top; public: Stack(); elemType& pop(); void push( const elemType &et ); // … };

Using Our Template Stack Stack<int> is( 10 ); for ( int ix = 0; ix < 10; ix++ ) is.push( ix*2 ); int elem_cnt = is.size(); for ( int ix = 0; ix < elem_cnt; ++ix ) cout << is.pop()); Stack<string> ss ( 10 ); ss.push( "Pooh" ); ss.push( "Piglet" ); ss.push( "Rabbit" ); ss.push( "Eeyore" ); elem_cnt = ss.size(); for ( int ix = 0; ix < elem_cnt; ++ix ) cout << is.pop());

The Parameter List • The parameter list serves in a manner similar to the signature of a function: • identifying the number and kind of each parameter, and • associating a unique identifier which each so that each can be referred to within the template definition. • The user creates object instances of the type by providing actual values to bind to the parameters. • The instantiation of a parameterized type consists of binding the actual user values with the associated formal parameter within the definition. • It is not a simple text substitution such as a macro expansion mechanism might employ.

The Template Parameter List • Templates support two additional categories of parameters • expression parameters and • template parameters. • In addition, templates support default values. typedef void (*handler)( const vector<string>& ); void defaultHandler( const vector<string>& ){ … } template < class elemType, int size = 1024, handler cback = &defaultHandler > class Stack {};

Type Constraints • In general, when we say that a parameterized type can support an infinite variety of types, we are speaking of passive uses of parameterization – that is, the storage and retrieval of a type rather than actively manipulating it. • Once we begin to manipulate an instance of a type parameter, such as • comparing whether one object of that type is equal to or less than another, or • when we invoke the name of a method or nested type through a type parameter, we discover that there are implicit constraints to the uses we can make of it.

Template Type Constraints are Implicit • One criticism of the C++ template mechanism is the absence of a formal syntax to delineate these sorts of type constraints. • That is, the user can only recognize the implicit constraints of a template either by reading the source code, the associated documentation, or compiling her code and reading the ensuing compiler error message. • There are a number of template facilities that allow the user some wiggle room. • For example, a class template member function is not bound to the type argument until there is a use of that function. • If that is not feasible, a second alternative is to provide a specialized version of that method that is specifically associated with our type argument.

Containers Are the CanonicalGeneric Type … • Containers, such as an array, associative array (map, dictionary), set, or list, are a natural abstraction for a generic solution. • In C++, the Standard Template Library (STL) provides a template container library that integrates support for the built-in array and is extensible to user-defined types. • This is what we will look at in the rest of the unit.

Three Primary STL Elements: Containers • There are three primary elements to the design of the Standard Template Library (STL) … • Containers • Each holds homogeneous elements in a particular `view’. • Predefined: vector<T>, list<T>, map<K,V>, set<K> … • Extensible: by following certain conventions, a user can define her own containers … vector<int> ivec; list<std::string> slist; map<std::string, int> wordCount; set<MyType> smt;

Three Primary STL Elements: Algorithms • There are three primary elements to the design of the Standard Template Library (STL) … • Algorithms • Each algorithm provides type-independent operations, where by type I mean one of the predefined or user-defined containers … • sort(), find(), merge(), accumulate(), random_shuffle, for_each() • Extensible: by following certain conventions, a user can define her own algorithms …

Three Primary STL Elements: Iterators • There are three primary elements to the design of the Standard Template Library (STL) … • Iterators • These provide a type independent way to specify a container location since a vector and a list, for example, have no direct common way to indicate `start at element 3’ … • Each container type provides a begin() and end() method to return an iteration representing the beginning and one past the end of the container. • Iterators are a class abstraction of a `pointer’ – overloads ++, --, *, ==, != … this is what allows an array, a list, and a vector to be passed to, for example, sort()

Header files and namespace … #include <vector> #include <list> #include <map> #include <set> #include <algorithm> using namespace std;

Iterators and Pointers … void f() { int ia[4] = {21, 8, 5, 13 }; int *pbegin = ia; int *pend = &ia[4]; vector<int> ivec( pbegin, pend ); // 1 list<int> ilist( pbegin, pend ); // 1 sort( pbegin, pend ); // 1 sort( ivec.begin(), ivec.end() ); // 2 sort( ilist.begin(), ilist.end() ); // 3 } // 1 integer pointer serving as `iterator’ // 2 begin() and end() return vector::iterator to first element + 1 past last … // 3 begin() and end() return list::iterator to first element + 1 past last …

Constraints on the ideal design:Formal Constraints … • Now, this is an idealized view of the Standard Template Library that does not prove feasible in practice for two primary reasons … • Formal Constraints • A container type represents a view of the homogeneous elements – not all container types support all algorithms. • Neither a map nor a set, for example, support random_shuffle() or any reordering of the elements …

Constraints on the ideal design:Practical Constraints … • Now, this is an idealized view of the Standard Template Library that does not prove feasible in practice for two primary reasons … • Practical Constraint – • It is more expensive to sort() or find() an element in a list with the generic algorithms than to do the same algorithms on a vector ... • One would wish for both containers to perform with parity. Not to penalize users of a list, the list container provides its own optimized member operations for algorithms such as sort() and find().

Algorithms are really block algorithms … • So, in a sense, the generic algorithms are block algorithms • block in the sense of contiguous storage … • generic in the sense that they manipulate any block of element type … • That is, the generic algorithms are less appropriate with a list or the associative container types such as a map or set … • In fact, when I worked with Alex Stepanov at Bell Laboratories, he referred to them as block algorithms …

Let’s Try to Do Something • We’re to read an arbitrary number of text files of arbitrary size. • Each text file is to be stored as string objects in some sort of container. • We do not know how many files to read nor how large is each file. How might we do that?

Reading and Storing Text Files … typedef vector<string> textwords; vector< string > textFiles; // file paths vector< textwords > texts; // vector of vector<string> textwords::iterator it = textFiles.begin(); textwords::iterator end_it = textFiles.end(); while ( it != end_it ) { vector<string> text; ifstream infile( it->c_str() ); if ( infile ){ istream_iterator<string> iter( infile ), end_iter; copy( iter, end_iter, back_inserter( text )); texts.push_back( text ); } ++it; }

Encapsulating It … typedef vector<string> textwords; class TextManager { public: TextManager( string *first, string *last ) : textFiles( first, last ){} void readFiles(); private: vector< string > textFiles; vector< textwords > texts; }; string fileTable[fileCount] = { “…”, “…” }; TextManager tm( fileTable, fileTable+fileCount ); tm.readFiles(); tm.displayFiles();

Using It … int main() { const int fileCount = 2; string fileTable[fileCount] = { "ESCJava.txt", "Lookup.txt" }; TextManager tm( fileTable, fileTable+fileCount ); tm.readFiles(); tm.displayFiles(); tm.processText(); tm.displayText(); }

Continuing the Something … • Merge a copy of each container • Remove punctuation • Sort the container alphabetically • Remove all duplicate words

vector<textwords>::iterator iter = texts.begin(); // merge a copy of each text file for ( ; iter != texts.end(); ++iter ) copy( iter->begin(), iter->end(), back_inserter( theText )); // remove the punctuation filter_text(); // sort the elements of texts sort( theText.begin(), theText.end() ); // delete all duplicate elements : two steps ... vector<string>::iterator it = unique( theText.begin(), theText.end() ); theText.erase( it, theText.end() );

filter_text() { string filt_elems( "\",.;:!?)(\\/" ); vector<string>::iterator iter = theText.begin(); vector<string>::iterator iter_end = theText.end(); while ( iter != iter_end ) { string::size_type pos = 0; while (( pos = iter->find_first_of(filt_elems,pos)) != string::npos ) iter->erase(pos,1); iter++; } }

Resort the container by length – but keep the words alphabetize within length. bool less_than( const string &s1, const string &s2 ) { return s1.size() < s2.size(); } // the generic algorithm stable_sort( theText.begin(), theText.end(), less_than ); • What is problematic with this?

Performance plus Flexibility • While this gets the job done, it is considerably less efficient than we might wish. • less_than() is implemented as a single statement. Normally, it would be invoked as an inline function. • By passing it in as a pointer to function, however, we prevent it from being inlined. • An alternative strategy to preserve the inline-ability of the operation is that of a function object.

Function Objects class LessThan { public: bool operator()( const string & s1, const string & s2 ) { return s1.size() < s2.size(); } }; stable_sort( texts.begin(), texts.end(), LessThan() );

Count the number of words longer than some value specified value. int cnt = count_if( theText.begin(), theText.end(), GreaterThan() );

Another Function Object … class GreaterThan { public: GreaterThan( int sz = 6 ) : _size( sz ){} int size() { return _size; } bool operator()(const string & s1) { return s1.size() > _size; } private: int _size; }

Print out the resulting text. vector<string>::iterator it = theText.begin(); vector<string>::iterator end_it = theText.end(); while ( it != end_it ) cout << *it++;

Alternative Print Solution … • Print out the resulting text. class PrintElem { public: PrintElem( int len = 8 ) { … } void operator()( const string &elem ); private: int _line_length; int _cnt; }; for_each( theText.begin(), theText.end(), PrintElem() );

Designing a Generic Algorithm

Designing Generic Algorithms • This is the generic algorithm we’ll finish with … template <typename InIter, typename OutIter, typename ElemType, typename Comp> OutIter filter( InIter first, InIter last, OutIter at, const ElemType &val, Comp pred ) { while ((first = find_if(first,last,bind2nd(pred,val))) != last) *at++ = *first; return at; }

Invoking the Algorithm • Hopefully it is not too obvious as to what it does or how one would write it. In any case, here is how we might use it: int main() { // .. ia & ia2 are built-in arrays, // .. ivec & ivec2 are STL vectors cout << "filtering for values less than 8\n"; filter( ia, ia+elem_size, ia2, 8, less_than ); cout << "filtering for values greater than 8\n"; filter( ivec.begin(), ivec.end(), ivec2.begin(), 8, greater_than ); }

A First Iteration • The primary focus of this unit is to arrive at this design through a series of iterations. • Here is our first iteration. What does it do? int* find( const vector<int> &vec, int value ) { for ( int ix = 0; ix < vec.size(); ++ix ) if ( vec[ ix ] == value ) return &vec[ ix ]; return 0; }

A Second Iteration • Ok. We need to have our function work not just with integers, but with any type in which the equality operator is defined. • That’s simple. We turn it into a function template: template <typename elemType> elemType* find( const vector<elemType> &vec, const elemType &value ) { for ( int ix = 0; ix < vec.size(); ++ix ) if ( vec[ ix ] == value ) return &vec[ ix ]; return 0; }

A Third Iteration • Ok. Things get a little trickier now. • We next need to modify our function such that a single instance can accept either a vector or a built-in array. • Let’s first solve the problem for the built-in array. • How can we pass in the elements of an array without specifying the array itself?

A Third Iteration • Pass in a pointer to the first element of the array. • Pass in a second pointer addressing one past the last element. • Advance the first pointer until it equals the second pointer. template <typename elemType> elemType* find( const elemType *first, const elemType *last, const elemType &value ) { if ( ! first || ! last ) return 0; for ( ; first != last; ++first ) if ( *first == value ) return first; return 0; }

A Third Iteration • How might our function be invoked? int ia[ 8 ] = { 1, 1, 2, 3, 5, 8, 13, 21 }; int *pi = find( ia, ia+8, ia[3] ); double da[6] = { 1.5, 2.0, 2.5, 3.0, 3.5, 4.0 }; double *pd = find( da, da+6, da[3] ); string sa[4] = { “pooh”, “piglet”, “eeyore”, “tigger” }; string *ps = find( sa, sa+4, sa[3] );

A Fourth Iteration • Ok, so what do we need to do to have it support a vector as well? • A vector holds its elements in a contiguous area of memory, and so we can pass in the begin/end pair of addresses much as we do for the built-in array. • Except in one case. (There’s always one case, isn’t there?) • Do you know what that is?

A Fourth Iteration • Unlike an array, a vector can be empty. For example, the following defines an empty vector of string elements: vector<string> svec; • The following invocation fails miserably if svec is empty find( &svec[0], &svec[svec.size()], search_value ); • A safer implementation is to first test that svec is not empty. if ( ! svec.empty() ) // ... Ok • While this is safer, it is somewhat unpleasant for the user!

Becoming More Effective with C++ … Day Two