StringChain -- efficient management of strings which are produced and consumed
in chunks

trac:

http://tahoe-lafs.org/trac/stringchain

darcs repository:

http://tahoe-lafs.org/source/stringchain/trunk/

To run tests:

python ./setup.py test

To run benchmarks:

python -OOu -c 'from stringchain.bench import bench; bench.quick_bench()'


StringChain ¶

Sometimes you want to accumulate data from some source while occasionally
consuming some of the first bytes of the data. The naive way to do it in Python
is like this:

def __init__(self):
    self.accum = '' # Will hold all unprocessed bytes

def add_data(self, some_more_data):
    # some_more_data is a string
    self.accum += some_more_data

def consume_some(self, how_much):
    some = self.accum[:how_much]
    del self.accum[:how_much]

This works fine as long as the total amount of bytes accumulated and the number
of separate add_data() events stay small, but it has O(N**2) behavior and has
bad performance if those numbers get large.

StringChain instead holds a list (actually a deque) of the strings, appends to
that list when you add a new string, and pops from the beginning of the list
when you consume some data. (It might have to pop only part of the leading
string since you might not consume a number of bytes exactly equal to the size
of the string.)

Here are some benchmarks generated by running python -OOu -c 'from
stringchain.bench import bench; bench.quick_bench()':

The the left-hand column is how many bytes were in the test dataset. The
results are nanoseconds per byte processed in exponential (scientific)
notation. "Stringy" means the string-based idiom sketched above.

$ python -OOu -c 'from stringchain.bench import bench; bench.quick_bench()'
impl:  StringChain
task:  _accumulate_then_one_gulp
  10000 best: 2.408e+00
 100000 best: 2.310e+00
1000000 best: 2.073e+00
task:  _alternate_str
  10000 best: 6.509e+00
 100000 best: 4.320e+00
1000000 best: 4.011e+00
impl:  Stringy
task:  _accumulate_then_one_gulp
  10000 best: 1.788e+00
 100000 best: 2.227e+01
1000000 best: 2.241e+02
task:  _alternate_str
  10000 best: 3.695e+00
 100000 best: 1.612e+01
1000000 best: 1.127e+02

The naive approach is slower than the StringChain class, and the bigger the
dataset the slower it goes. The StringChain class is fast and also it is
scalable (with regard to these benchmarks at least...).

Okay how do you use it? It is very simple -- see stringchain/stringchain.py and
let me know if that interface doesn't fit your use case.

You can get the package from http://pypi.python.org/pypi/stringchain or with
darcs get http://tahoe-lafs.org/source/stringchain/trunk.

It has unit tests. It is in pure Python (it uses collections.deque and string).


LICENCE

You may use this package under the GNU General Public License, version 2 or, at
your option, any later version.  You may use this package under the Transitive
Grace Period Public Licence, version 1.0, or at your option, any later version.
(You may choose to use this package under the terms of either licence, at your
option.)  See the file COPYING.GPL for the terms of the GNU General Public
License, version 2.  See the file COPYING.TGPPL.html for the terms of the
Transitive Grace Period Public Licence, version 1.0.
