vector<string>
is a stalwart ally, but it wilts in the face of a list of one billion twelve-letter strings. Is there any way to deal effectively with a search space of this size? And can we maintain the intuitive semantics of the STL containers while dealing with a list of one billion strings?
We need an entity that presents a container-like interface but is capable of handling arbitrarily large lists of strings made up of combinations of smaller lists [1]. And it has to do it quickly and without using inordinate amounts of memory. And the lists can be added at the beginning or the end or, in fact, anywhere in the middle. And, just in case that's too easy, let's templatize it on the type of entity contained and any traits and policy we can think of.
Voice contains Spring growth (8)Contrary to surface appearance, this clue instructs us to put a synonym of 'spring' inside a synonym of 'voice' to produce a synonym of 'growth'. A computer programme might generate a list of suitable synonyms of 'voice' and 'spring' and then combine them into a longer list of all possible combinations. For instance, here are the 18 four-letter synonyms of 'voice' produced by the programme's synonym generator:
aver, call, chat, hint, jive, roar, sing, show, song, talk, tell, tone, vent, view, will, wish, word, yellAnd here are the 21 four-letter synonyms of 'spring':
bolt, buck, burn, come, flow, flee, free, grow, gush, head, jump, leap, loom, lope, rill, rise, root, skip, stem, trip, wellThese can be combined into three separate lists of eight-letter sequences (one each for insertion after the first, second and third letter of the synonym of 'voice'). Even this toy example produces a total of 11340 candidate eight-letter strings. What we want to know is whether any item in these lists matches any of our 15 eight-letter synonyms for 'growth':
addition, amassing, boosting, dilation, dividend, earnings, hoarding, increase, maturing, offshoot, parasite, progress, spurting, swelling, uprisingIf so, we will have solved the clue. The following code implements this clue-solving algorithm:
std::vector<std::string> * asyns = synonymizer.get_by_len("spring", 4); std::vector<std::string> * bsyns = synonymizer.get_by_len("voice", 4); std::vector<std::string> * csyns = synonymizer.get_by_len("growth", 8); typedef Catenator<std::string> StringCat_type; std::vector<StringCat_type *> Catenatorlist; StringCat_type Catenatorb(bsyns); StringCat_type Catenatora(asyns); for (std::size_t i = 1; i < 4 ; ++i) { Catenatorlist.push_back( new StringCat_type(&Catenatorb, &Catenatora, i) ); } for (std::size_t j = 0; j < csyns.size(); ++j) { for (std::size_t k = 0; k < Catenatorlist.size(); ++k) { std::vector<std::string> * solutions = Catenatorlist[k]->match((*csyns)[j]); if (!solutions->empty()) { std::cout << "Solution to clue is: " << (*solutions)[0] << "\n"; } delete solutions; } }The programme should print the following [ 2]: Solution to clue is: swelling The second example is a programme to explore gene sequencing. Amino acids are represented by three-letter sequences of the letters 'A', 'C', 'G' and 'T'. Combinations of these letter sequences make different genes which encode different proteins. Genes can be spliced and new sequences inserted in the middle, or joined together end-to-end. A programme might start with a list of letter sequences which are combined using a certain procedure (or at random) to form a large number of long letter sequences, each of which is a gene. This might be a cheap way to simulate the production of genes using conventional experimentation or an exploration of presumed activity 'in the wild'. The experimenter would then like to know whether particular proteins are encoded by these genes: that is, whether a particular sequence of letters appears in a particular place in any of these genes. This could be implemented using Catenators as follows:
std::size_t max_depth = 20 std::size_t sample_size = 10 typedef Catenator<std::string, WildCardStringTraits> CatenatorGene_type; std::vector<std::string> amino_acids; amino_acids.push_back("ACT") amino_acids.push_back("GAT") ... etc ... CatenatorGene_type * gene = new( CatenatorGene_type ); for (std::size_t i = 0; i < max_depth; ++i) { CatenatorGene_type * swapgene; std::vector<std::string> sample = select_random_sample(amino_acids.begin(), amino_acids.end(), sample_size) CatenatorGene_type catsample(sample); switch (i % 3 ) { // alternately append, prepend and insert case 0: swapgene = new CatenatorGene_type(gene, &catsample); break; case 1: swapgene = new CatenatorGene_type(&catsample, gene); break; case 2: swapgene = new CatenatorGene_type(gene, &catsample, get_random_insert(sample_size)); break; } swap( swapgene, gene); delete swapgene; } std::vector<std::string> * matches = gene->match("???????????ACAATTGGTATG???? ... etc ... ?????"); for (std::size_t j = 0; j < matches->size(); ++j) { std::cout << "Matching gene: " << (*matches)[j] << "\n"; }Even this toy example features a list of 10 20 genes, each of which is 60 characters long. However, the size of this list represented as a Catenator will be approximately 600 characters and, crucially, the matching algorithm performs in line with the 600 character size, not the 60 * 10 20 list size (although as you add more wild cards, performance steadily degrades) [ 3].
std::vector<std::string> * concatenate( const std::vector<std::string> & first, const std::vector<std::string> & second) { result = new std::vector<std::string>; for (std::size_t i = 0; i < first.size(); ++i) { for (std::size_t j = 0; j < second.size(); ++j) { //blows up when vector gets too large result->push_back( first[i] + second[j] ); } } return result; }Our mistake is to confuse interface and implementation. We want a nice, tidy vector-like object and so we flatten our list of strings into a nice, tidy vector. Unfortunately, this tramples over the inherent structure in the data. Instead, we should work with the structure—we will deal with interfaces later [ 4]. Many complex computing problems are reduced to simple computing problems by an appropriate choice of data structure—once the data structure is fixed; the code flows effortlessly [ 5]. The list of sub-strings is generated by search through a problem space—usually this will involve traversal of a tree-like structure. Let's take a look at a typical string that is constructed from the sub-strings resulting from this type of traversal:
The data structure underneath the flat representation of the string is a directed acyclic graph, but it is not the usual type of tree. Note that it contains both 'horizontal arrows' representing appending (or prepending) of different lists of strings and 'vertical arrows' representing insertion. Also, there may be a number of insertions in a particular list of sub-strings, up to a maximum of one at each possible insertion point. Therefore, we should expect our algorithms to contain a combination of iteration as well as the recursion more normally associated with trees.
Information about where in the flat representation of the string each individual letter is located is not stored 'locally'. In other words, a particular list of sub-strings contained at some level in the hierarchy cannot know where in the ultimate string it will end up—it only knows where to put the sub-strings that it contains. The 'AD' segment knows that it contains another two-letter segment after the first letter, and so the letter 'D' will appear three slots after the letter 'A', but it does not know where in the ultimate string its complete four-letter section will be slotted. This is a necessary trade-off: it means that sub-trees are also complete, well-formed trees themselves and we can use the same techniques to manipulate parts of trees and entire trees. In other words, the 'EI' segment above is itself a five-letter Catenator consisting of a two-letter list with another three letters inserted at position one.
The resulting data structure is a sequence of lists of sub-strings, one after the other (represented as a vector of vectors of strings ordered by index), together with a map for each sequence showing where in the sub-string other trees are inserted (represented as a map of insertion indices to pointers to sub-trees):
typedef std::vector<value_type> vect_type; typedef std::map<std::size_t, class_type *> cqmap_type; typedef typename std::pair<vect_type *, cqmap_type *> cqpair_type; std::vector<cqpair_type> c_; //private member variableCreation of a Catenator now involves only filing the lists of strings in the appropriate place. The expense in time and space of iterating through the lists and combining them disappears [ 6]. Appending and prepending new lists is straightforward. Insertion of a new list of strings in an existing Catenator is slightly trickier—we must make sure that insertion at the end of a string is actually represented as appending to a string (and that insertion at the beginning is prepending) and that two insertions in the same place are combined into a single insertion. The insertion constructor takes two Catenators, one to be inserted inside the other, and the position of the insertion as arguments:
Catenator(const Catenator& cqfirst, const Catenator& cqsecond, std::size_t inpos) : s_(0), l_(0)
To construct the new Catenator, we first advance through each constituent Catenator within the first Catenator until we reach the one into which the insertion is to be made, copying the constituents as we go along:
//copy prior to insert, and find insert point std::size_t i = 0; while (inpos > cqfirst.length(i)) { inpos -= cqfirst.length(i); cqfirst.copypart(this, i); }
Once this is completed, inpos
will contain the position within the i
'th Catenator in which to make the insertion. This is straightforward if the insertion occurs at the beginning:
if (inpos == 0) { // insert before next item for (std::size_t k = 0; k < cqsecond.c_.size(); ++k) { cqsecond.copypart(this, k); } cqfirst.copypart(this, i); }Or at the end:
f (inpos == cqfirst.length(i)) { //insert after next item cqfirst.copypart(this, i); for (std::size_t k = 0; k < cqsecond.c_.size(); ++k) { cqsecond.copypart(this, k); } }But if the insertion occurs in the middle of this Catenator, it might occur in the middle of an insertion within that Catenator. Also, we do not know how many characters are in the top level of this Catenator, and how many are buried in insertions. Just because the insertion is at character number five, for example, it does not mean that the required insertion index is five: there might be a three letter insertion at character one, so that an insertion at character five is accomplished with an index of two. And to cap it all, because there is only one insertion allowed at each index, we need to cater separately not only for insertions in the middle of other insertions, but also insertions at the beginning or end of other insertions. In short, all the following are possible:
The solution is to step through each insertion until we reach the insertion index, and then take the appropriate action:
typename cqmap_type::iterator mpos = cqfirst.c_[i].second->begin(); while ((mpos != cqfirst.c_[i].second->end()) && (inpos >= mpos->first)) { if (inpos > mpos->first + mpos->second->length()) { //after end of this insertion inpos -= mpos->second->length(); mpos++; } else if (inpos == mpos->first) { //at beginning of this insertion Catenator * prepcq = new Catenator(&cqsecond, mpos->second); cqfirst.copypartinsert(this, i, prepcq, mpos->first); inpos = 0; } else if (inpos == mpos->first + mpos->second->length()) { //at end of this insertion Catenator * prepcq = new Catenator(mpos->second, &cqsecond); cqfirst.copypartinsert(this, i, prepcq, mpos->first); inpos = 0; } else { //in middle of this insertion Catenator * newcq = new Catenator(*(mpos->second), cqsecond, inpos - mpos->first); cqfirst.copypartinsert(this, i, newcq, mpos->first); inpos = 0; } } if (inpos > 0) { //need brand new insertion cqfirst.copypartinsert(this, i, cqsecond.clone(), inpos); }
Finally, the constituent Catenators following the insertion are copied into the new Catenator:
//copy items after insertion for ( ++i ; i < cqfirst.c_.size(); ++i) { cqfirst.copypart(this, i); }
Putting all these pieces together, we obtain the complete insertion constructor:
//constructor inserting cqsecond in cqfirst at position inpos Catenator(const Catenator& cqfirst, const Catenator& cqsecond, std::size_t inpos) : s_(0), l_(0) { //copy prior to insert, and find insert point std::size_t i = 0; while (inpos > cqfirst.length(i)) { inpos -= cqfirst.length(i); cqfirst.copypart(this, i); if (++i == cqfirst.c_.size()) throw CatenatorException("Out of range in catenator insert"); } //perform insertion if (inpos == 0) { // insert before next item for (std::size_t k = 0; k < cqsecond.c_.size(); ++k) { cqsecond.copypart(this, k); } cqfirst.copypart(this, i); } else if (inpos == cqfirst.length(i)) { //insert after next item cqfirst.copypart(this, i); for (std::size_t k = 0; k < cqsecond.c_.size(); ++k) { cqsecond.copypart(this, k); } } else { //insert in middle of next item typename cqmap_type::iterator mpos = cqfirst.c_[i].second->begin(); while ((mpos != cqfirst.c_[i].second->end()) && (inpos >= mpos->first)) { //not before next insertion if (inpos > mpos->first + mpos->second->length()) { //after end of this insertion inpos -= mpos->second->length(); mpos++; } else if (inpos == mpos->first) { //at beginning of this insertion Catenator * prepcq = new Catenator(&cqsecond, mpos->second); cqfirst.copypartinsert(this, i, prepcq, mpos->first); inpos = 0; } else if (inpos == mpos->first + mpos->second->length()) { //at end of this insertion Catenator * prepcq = new Catenator(mpos->second, &cqsecond); cqfirst.copypartinsert(this, i, prepcq, mpos->first); inpos = 0; } else { //in middle of this insertion Catenator * newcq = new Catenator(*(mpos->second), cqsecond, inpos - mpos->first); cqfirst.copypartinsert(this, i, newcq, mpos->first); inpos = 0; } } if (inpos > 0) { //need brand new insertion cqfirst.copypartinsert(this, i, cqsecond.clone(), inpos); } } //copy items after insertion for ( ++i ; i < cqfirst.c_.size(); ++i) { cqfirst.copypart(this, i); } }The second half of the problem is to present an intuitive interface into this daunting data structure. Our intuition is to treat the structure as a vector: a flat list of strings [ 7]. The interface to the Catenator connives in this pretence. The indexing operator uses a combination of iteration, recursion and modular arithmetic to assemble the indexed string on demand:
value_type operator[](std::size_t ind) const { value_type t = traits_type::createempty(); for (std::size_t i = 0; i < c_.size(); ++i) { std::size_t nextind = ind % c_[i].first->size(); value_type newt((*c_[i].first)[nextind]); ind -= nextind; ind /= c_[i].first->size(); if (c_[i].second) { typename cqmap_type::const_reverse_iterator it(c_[i].second->rbegin()); while (!(it == (typename cqmap_type::const_reverse_iterator) c_[i].second->rend())) { std::size_t mind = ind % it->second->size(); value_type mt = (*it->second)[mind]; ind -= mind; ind /= it->second->size(); newt = traits_type::insert(newt, mt, it->first); it++; } } t = traits_type::join(t, newt); } if (ind > 0) throw CatenatorException("Catenator buffer overflow."); return t; }Obviously, we incur the cost of concatenating sub-strings when we access an individual member of the Catenator—but only for the particular one that we need. The cost of iterating through every member of a Catenator using the array operator:
for (std::size_t j = 0; j < cq.size(); ++j) { std::cout << cq[j]; }Is the same as the cost of creating the entire flat vector—plus a small additional constant cost for each instance [8]. However, it shouldn't crash your computer: the cost in space of storing the entire vector is still averted.
Together with the array operator, associated methods such as 'size' and 'length' are provided. Mutable private member variables are used to speed up these member functions without jeopardizing their const-ness.
As the examples demonstrate, we usually create a Catenator because we are looking for something in it—not because we are equally interested in every single member of the impossibly long sequence of strings it contains. The most elegant use of a Catenator is to create the entire list, search for something in it and extract all the useful parts—all without ever incurring the cost of creating or even iterating through its member strings. For this purpose, a search routine is provided to find all strings that match according to the criteria set out by the traits class. The public routine merely despatches the request to a private helper class:
t_type * mightmatch(value_type tofind) { if (length() != traits_type::length(tofind)) return 0; return findpart(tofind, tofind); }The
findpart
routine divides and conquers. It loops through the individual constituent Catenators. For each one, it assembles two sets of strings from the original search string: firstly, the part that corresponds to the constituent Catenator itself, and, secondly, the parts that correspond to each insertion within that Catenator. For instance, if the search string was "abcdef" and the constituent Catenator was two letters long with a four-letter insertion at position one, the top level string to match would be "af" and the single insertion string to match would be "bcde". The heavy lifting for this division is outsourced to the
topmatcher
routine:
vect_type * accum = new(vect_type); vect_type * newaccum = new(vect_type); for (std::size_t i = 0; i < c_.size(); ++i) { //for each top-level constituent T std::map<std::size_t, value_type> inserts; value_type confind = topmatcher(thisfind, i, inserts);Then we actually need to get the list of matches for these candidates. Once again, we rely on the good offices of two other routines, couldmatch and addinsertmatches, to parcel up the results into
maybes
:
vect_type * maybes = couldmatch(i, confind, allfind); //top level match only maybes = addinsertmatches(maybes, inserts, i, allfind); //put insert matches into maybesNow we just have to combine our results for this constituent Catenator with the previous constituents. This requires us to combine our list of matches so far with the current list of matches to create a new list which the next iteration of the loop will use as the list of matches so far. There are many ways to implement this using an extravagant amount of memory, cohorts of obscure pointers and pointer arithmetic or inefficient insertions at the beginning of sequences. The
findpart
routine simply uses two accumulating vectors which it then swaps as it goes along. The two vectors have to be allocated on the heap and explicitly deleted when no longer needed. That portion looks like this (full code appears further down):
if (i == 0) { newaccum->insert(newaccum->begin(), maybes->begin(), maybes->end()); } else { for (std::size_t n = 0; n < accum->size(); ++n) { for (std::size_t m = 0; m < maybes->size(); ++m) { newaccum->push_back(traits_type::join((*accum)[n], (*maybes)[m])); } } } delete maybes; swap(accum, newaccum); newaccum->clear(); } delete newaccum; return accum;And finally, here is the whole
findpart
routine:
// //top level helper routine for search // private: vect_type * findpart(value_type thisfind, value_type allfind) { vect_type * accum = new(vect_type); vect_type * newaccum = new(vect_type); for (std::size_t i = 0; i < c_.size(); ++i) { //for each top-level constituent T std::map<std::size_t, value_type> inserts; value_type confind = topmatcher(thisfind, i, inserts); //get top level to match and fill map of inserts to match vect_type * maybes = couldmatch(i, confind, allfind); //top level match only maybes = addinsertmatches(maybes, inserts, i, allfind); //put insert matches into maybes if (i == 0) { newaccum->insert(newaccum->begin(), maybes->begin(), maybes->end()); } else { for (std::size_t n = 0; n < accum->size(); ++n) { for (std::size_t m = 0; m < maybes->size(); ++m) { newaccum->push_back(traits_type::join((*accum)[n], (*maybes)[m])); } } } delete maybes; swap(accum, newaccum); newaccum->clear(); } delete newaccum; return accum; }The
topmatcher
routine runs into the familiar problem of working out where an insertion point is actually located within a particular Catenator. As usual, the solution is to step through the Catenator, keeping track of the index and slicing up the search string as we go along: adding the part before the insert to the top-level search string and the part within the insert to the map of insertion search strings. The different slices create a string for the top-level match and a string for matching each insertion:
// //top level constituent finding helper routine // value_type topmatcher(value_type thisfind, std::size_t index, std::map<std::size_t, value_type>& inserts) { std::size_t offset = 0; for (std::size_t j = 0; j < index; ++j) { offset += length(j); } value_type confind = traits_type::createempty(); std::size_t orig = 0; for (typename cqmap_type::const_iterator it = c_[index].second->begin(); it != c_[index].second->end(); ++it) { //for each insert in that constituent std::size_t gap = it->first - orig; confind = traits_type::join(confind, traits_type::substr(thisfind, offset, gap)); //add next bit to match string inserts[it->first] = traits_type::substr(thisfind, offset + gap, it->second->length()); //get matching bit for insert offset += gap + it->second->length(); //advance index into match string past insert orig = it->first; } confind = traits_type::join(confind, traits_type::substr(thisfind, offset, traits_type::length((*(c_[index].first))[0]) - orig)); //add final match part return confind; }Finally,
addinsertmatches
takes the top-level search string and the list of inserts and finds all the possible matches, which it then combines into a single list. It calls
findpart
recursively to discover the list of matching strings for each insert. To avoid painful index juggling when combining the list of results with the lists of matching insertions, it uses
reverse_iterator
rather than
iterator
, adding each list of insertions backwards starting at the end. Once again, it uses heap allocation and
swap
to handle accumulation of lists within a loop:
// //add insert matches to top level matches, helper routine // vect_type * addinsertmatches(vect_type * maybes, std::map<std::size_t, value_type>& inserts, std::size_t index, value_type allfind) { for (typename std::map<std::size_t, value_type>::reverse_iterator mrit = inserts.rbegin(); mrit != inserts.rend(); ++mrit) { //combine with matches for inserts vect_type * maybeaccum = new(vect_type); vect_type * nextins = (*(c_[index].second))[mrit->first]->findpart(mrit->second, allfind); for (std::size_t b = 0; b < maybes->size(); ++b) { for (std::size_t c = 0; c < nextins->size(); ++c) { maybeaccum->push_back(traits_type::insert((*maybes)[b], (*nextins)[c], mrit->first)); } } delete nextins; swap(maybeaccum, maybes); delete maybeaccum; } return maybes; }Finally, here is the complete picture:
// //interface routine, dispatches to private helper //parameter is concatenation of individual T units //that could include wildcards depending on 'match' //traits routine // public: vect_t * mightmatch(value_t tofind) { if (length() != traits_t::length(tofind)) return 0; return findpart(tofind, tofind); } // //top level helper routine for search // private: vect_type * findpart(value_type thisfind, value_type allfind) { vect_type * accum = new(vect_type); vect_type * newaccum = new(vect_type); for (std::size_t i = 0; i < c_.size(); ++i) { //for each top-level constituent T std::map<std::size_t, value_type> inserts; value_type confind = topmatcher(thisfind, i, inserts); //get top level to match and fill map of inserts to match vect_type * maybes = couldmatch(i, confind, allfind); //top level match only maybes = addinsertmatches(maybes, inserts, i, allfind); //put insert matches into maybes if (i == 0) { newaccum->insert(newaccum->begin(), maybes->begin(), maybes->end()); } else { for (std::size_t n = 0; n < accum->size(); ++n) { for (std::size_t m = 0; m < maybes->size(); ++m) { newaccum->push_back(traits_type::join((*accum)[n], (*maybes)[m])); } } } delete maybes; swap(accum, newaccum); newaccum->clear(); } delete newaccum; return accum; } // //top level constituent finding helper routine // value_type topmatcher(value_type thisfind, std::size_t index, std::map<std::size_t, value_type>& inserts) { std::size_t offset = 0; for (std::size_t j = 0; j < index; ++j) { offset += length(j); } value_type confind = traits_type::createempty(); std::size_t orig = 0; for (typename cqmap_type::const_iterator it = c_[index].second->begin(); it != c_[index].second->end(); ++it) { //for each insert in that constituent std::size_t gap = it->first - orig; confind = traits_type::join(confind, traits_type::substr(thisfind, offset, gap)); //add next bit to match string inserts[it->first] = traits_type::substr(thisfind, offset + gap, it->second->length()); //get matching bit for insert offset += gap + it->second->length(); //advance index into match string past insert orig = it->first; } confind = traits_type::join( confind, traits_type::substr(thisfind, offset, traits_type::length((*(c_[index].first))[0]) - orig)); //add final match part return confind; } // //add insert matches to top level matches, helper routine // vect_type * addinsertmatches(vect_type * maybes, std::map<std::size_t, value_type>& inserts, std::size_t index, value_type allfind) { for (typename std::map<std::size_t, value_type>::reverse_iterator mrit = inserts.rbegin(); mrit != inserts.rend(); ++mrit) { //combine with matches for inserts vect_type * maybeaccum = new(vect_type); vect_type * nextins = (*(c_[index].second))[mrit->first]->findpart(mrit->second, allfind); for (std::size_t b = 0; b < maybes->size(); ++b) { for (std::size_t c = 0; c < nextins->size(); ++c) { maybeaccum->push_back(traits_type::insert((*maybes)[b], (*nextins)[c], mrit->first)); } } delete nextins; swap(maybeaccum, maybes); delete maybeaccum; } return maybes; }
All these gymnastics are fortunately concealed within the private implementation of the Catenator. The user has a container that looks and feels like a vector, but handles the combinations of sub-strings without complaining. Problem solved; everyone can go home.
Do you think of vectors, deques and the other STL containers as arrays on steroids? You were probably taught it that way. You may have been told that officially these containers have great freedom in how they are implemented internally, that an iterator is free to be a big, grown-up class, and certainly not just a humble pointer, and that you shouldn't rely on internal implementation details [10]. But the array paradigm is just so hard to resist.....
This isn't necessarily bad. It breaks encapsulation, in that you are making an assumption about the internal implementation of a container, rather than relying on the bare information in its external interface, but it does allow you to make use of your intuitive understanding of arrays and pointers. It is an example of structural conformance -- the container presents a particular interface that we recognise from other contexts. It looks and smells like an array—and we already understand arrays. Provided that it performs in the way people think it should, nothing goes wrong. This isn't really a software design issue -- it is more a matter of philosophy as to the relative priorities of strict encapsulation and structural conformance. How useful is it to pander to people's expectations, if it might potentially lead them astray?
If all proxied containers were as redundant as vector<bool> [11], and all useful containers were as STL-compliant as vector<string>, we would demand absolute structural conformance and sweep aside the very occasional corner-case. We would not countenance the risk of anyone being deceived by a familiar interface. However, the Catenator is a perfect counter-example. It is intuitively a container with some structural conformance to a vector, but if you think of it as just like a vector, or an array on steroids, you might occasionally be led astray. It is up to the programmer to remember that he is dealing with impossibly large lists and to act accordingly. You could instead provide a completely bespoke interface:
Catenator<std::string> catenator1(vectorstring); for (std::size_t i = 0; i < catenator1.how_many_items_there_are(); ++i) { std::cout << catenator1.provide_checked_data_at_index(i); }But the cost in terms of reuse is surely too high. Most of the time, structural conformance is too useful to be ignored. Reconciliation is not a matter of rules but of broad principles. Here are a couple you might like to try out:
So catenator.size() should tell you how many items there are in the catenator and it should do it cheaply and not, say, by iterating through all the underlying items. Time to burn all those operator[]
implementations that use a linked list.
For instance, in the Catenator class we provide an operator[]
. The benefits of providing array-style access are balanced by a responsibility to prevent the programmer from misusing it. So a naive implementation might look like this:
value_type operator[](std::size_t index);But this is nasty to any hapless programmer. The following code will compile and run flawlessly:
value_type t = catenator[1]; catenator[2] = t;The first line is blameless. The second line will faithfully copy
t
into an unnamed temporary before casting it into the abyss, along with your career. One simple solution is to define
operator[]
as:
const value_type operator[](std::size_t index) const;And this is the solution that is actually adopted within the Catenator. If you want finer control over the uses of the result of
operator[]
, replace the return type with a nested class within Catenator:
class Catenator { .... class ArrayReturn { value_type t_; public: ArrayReturn(value_type tt) : t_(tt) { } value_type& operator=(value_type tt) { //put lvalue use here } operator value_type( ) { //put rvalue use here, eg return t_; } }; ArrayReturn operator[](std::size_t index) { ... return ArrayReturn(result); } };But note that (unless implemented as a bolt-in class inheriting from
value_type
) this will not allow constructions such as:
catenator[1].const_mem_function();
The Catenator class is templatized on the underlying item together with a set of traits describing how to manipulate that item. For instance, the length of an item, reversing the item, cutting the last part of the item and inserting new bits in the item are all traits of the item. Another one of these manipulations allows the creation of new empty items:
template<typename T> CatenatorTraits { .... static T createempty() { return T(); } .... };
This trait is included to allow the Catenator to contain items that do not have a default constructor—whether a class can be default constructed or requires some other method of creation is surely a trait of that class.
But now consider a customised container aimed at improving the memory handling characteristics of existing containers. It would still be templatized on the contained item and also on a set of traits for that item. But the allocation procedure for new items would be a key policy decision for that container:
template< typename T, typename Tr = MemTraits<T>, typename M = MemAllocator<T> > MemContainer { .... };So how has the same operation transformed itself from a trait carried along with the underlying item into a policy of the containing class? Clearly, the distinction depends on the purpose of the container. There is no objective difference. This is a continuation of the philosophy of C++: providing programmers with sufficient flexibility to extend the language as they see fit. To determine whether an operation is a policy or a trait, you must first know the purpose of the enclosing class: a shift in perspective is all it takes to transform one into the other.
There is another example of this in the Catenator. The search routines provide scope for the use of wildcards or variable matches, such as matching upper and lower case. Is the way you match the underlying item a trait of that item or a policy of the Catenator? For instance, considering upper and lower case matching, are the strings themselves case-insensitive or is the Catenator simply choosing to ignore case when comparing the strings? This is a rare example where it can be argued either way. The actual Catenator implementation makes it a trait rather than a policy of the Catenator. This avoids adding another template argument to the Catenator itself (or using virtual functions to vary the matching operation). It also allows all the variation in behaviour depending on the member class to be determined in a single place (the traits class) and at compile time. This decision is based, once again, on how the Catenator is to be used in practice. It is assumed that many different types might be contained within a Catenator and that how you match those types is unlikely to vary much. If Catenators contained only one or two types, but a wide variety of behaviour in how those types matched was required, then it might be more efficient—and require less typing—if the matching algorithm were implemented instead as a separate policy.
C++ is variously applauded or criticised for leaving the programmer in charge. It does not take design decisions for you. The Catenator class, apart from being a useful solution to a set of common problems, shows when you have to take these decisions yourself. Those decisions are not always determined by logic, but also by philosophy, good taste and etiquette.
operator[]
to return a member of type reference
. In Table 32 in section 20.1.5., reference
is required to be a T&
. As the elements of the Catenator returned by operator[]
are constructed when needed and not stored anywhere, they don't have an address and so can't provide a T&
.deque
and other STL containers are still not restricted in this way: their storage need not be, and is generally not, contiguous.vector<bool>
, together with a list of its other shortcomings.Have an opinion? Readers have already posted 5 comments about this article. Why not add yours?
-
Artima provides consulting and training services to help you make the most of Scala, reactive
and functional programming, enterprise systems, big data, and testing.