Maintainable string and bytes pre-allocation in Rust
A common performance optimization in software development is to pre-allocate
strings/bytes before appending to them. In Rust, failing to do this may cause
the implementation of std::string::String
or std::vec::Vec
to frequently
reallocate bytes internally to deal with its growing size, which is slow.
Additionally, the amount of bytes we end up with at the end may be way more than
we actually need.
Why we pre-allocate
Take the following code:
fn main() {
let mut text = String::new();
let spaces = " ".repeat(100); // a 100 byte string
println!("Len: {}, Capacity: {}", text.len(), text.capacity());
for _ in 0..9 {
text.push_str(&spaces);
println!("Len: {}, Capacity: {}", text.len(), text.capacity());
}
}
When we run this, we can see the string being reallocated often as the length increases and in the end we're left with an over-allocated string of 1600 bytes instead of 900:
Len: 0, Capacity: 0
Len: 100, Capacity: 100
Len: 200, Capacity: 200
Len: 300, Capacity: 400
Len: 400, Capacity: 400
Len: 500, Capacity: 800
Len: 600, Capacity: 800
Len: 700, Capacity: 800
Len: 800, Capacity: 800
Len: 900, Capacity: 1600
However, when the capacity is correctly pre-allocated:
fn main() {
let mut text = String::with_capacity(900);
let spaces = " ".repeat(100); // a 100 byte string
println!("Len: {}, Capacity: {}", text.len(), text.capacity());
for _ in 0..9 {
text.push_str(&spaces);
println!("Len: {}, Capacity: {}", text.len(), text.capacity());
}
}
We only allocate text
once and end up with a string equal to its capacity:
Len: 0, Capacity: 900
Len: 100, Capacity: 900
Len: 200, Capacity: 900
Len: 300, Capacity: 900
Len: 400, Capacity: 900
Len: 500, Capacity: 900
Len: 600, Capacity: 900
Len: 700, Capacity: 900
Len: 800, Capacity: 900
Len: 900, Capacity: 900
Problem
Seen this kind of code before?
let capacity = items
.iter()
.filter_map(|i| i.maybe_name.as_ref())
.enumerate()
.map(|(i, name)| if i > 0 { 2 } else { 0 } + name.len())
.sum::<usize>();
let mut text = String::new();
text.try_reserve_exact(capacity)?;
for (i, name) in items
.iter()
.filter_map(|i| i.maybe_name.as_ref())
.enumerate()
{
if i > 0 {
text.push_str(", ");
}
text.push_str(name);
}
debug_assert_eq!(text.len(), capacity);
The above code:
- Calculates the byte capacity of the string ahead of time.
- Allocates bytes with that capacity, returning an error when it doesn't have enough memory to allocate.
- Builds up the final string without causing additional allocations of
text
. - Finally, it does a debug assertion to ensure the final text length matches the capacity we calculated ahead of time.
Although this code is performant, it has a couple of problems:
- It's complicated.
- No single source of truth.
- The capacity calculation code could get out of sync with the code that builds the string... the debug assertion helps, but not if all scenarios aren't tested in debug.
Solution
A solution to this problem is to make the code to calculate the capacity the
same as the code to build up the string. I've rolled this up into a crate that
makes it easy:
capacity_builder
// same functionality as the above code, but simpler
use capacity_builder::StringBuilder;
let text = StringBuilder::<String>::build(|builder| {
for (i, name) in items
.iter()
.filter_map(|i| i.maybe_name.as_ref())
.enumerate()
{
if i > 0 {
builder.append(", ");
}
builder.append(name);
}
})?;
This runs the closure twice: once to compute the capacity and a second time to build the string. The final string will have a length equal to its capacity and will never reallocate itself while it's being built.
Some features:
- Prevents allocations in the closure by only accepting values by reference (possible thanks to Rust's amazing borrow checker)
- Numbers can be appended (or anything that implements
capacity_builder::StringAppendable
) - Can be made to work with any string type and not just
std::string::String
- Building up bytes is possible via
capacity_builder::BytesBuilder
I've been integrating this into Deno's codebase and it's enabled us to start pre-allocating strings/vectors in complex cases that previously weren't maintainable.
If you have any suggestions or run into any issues, please open an issue on the project's GitHub: https://github.com/dsherret/capacity_builder