Strings - Rust for C-Programmers

18.3 The `String` Type

Rust’s String type represents a growable, mutable, owned sequence of UTF-8 encoded text. Like Vec<T>, the String struct itself is a small object on the stack, containing a pointer to its heap-allocated text data, a length, and a capacity. This design ensures that only the text data is stored on the heap, with the stack-based struct managing the heap-allocated memory. A call to String::new() does not allocate memory on the heap; it creates an empty String struct on the stack. The heap allocation happens only when data is first added to the string. This automatic and lazy memory management is a key advantage over manual C-style string handling.

18.3.1 Understanding `String` vs. `&str`

This distinction is fundamental in Rust and often a point of confusion for newcomers:

String: An owned, heap-allocated buffer containing UTF-8 text. It owns the data it holds. It is mutable (can be modified, e.g., by appending text) and responsible for freeing its memory when it goes out of scope. Think of it like a Vec<u8> specialized for UTF-8.
&str (string slice): A borrowed, immutable view into a sequence of UTF-8 bytes. It consists of a pointer to the data and a length. It does not own the data it points to. It can refer to part of a String, an entire String, or a string literal embedded in the program’s binary.
- String literals: Expressions like "hello" in your code have the type &'static str. The 'static lifetime means the reference is valid for the entire duration of the program, because the underlying string data ("hello") is embedded directly into the program’s binary data segment and thus lives forever.
- The str type: You might wonder about str without the &. str itself is the primitive sequence type, but it’s an unsized type (Dynamically Sized Type or DST) because its length isn’t known at compile time. Because variables and function arguments must have a known size, Rust requires that we always interact with str via pointers like &str (a “fat pointer” containing address and length) or Box<str> (an owned pointer). &str is the ubiquitous borrowed form.

You can get an immutable &str slice from a String easily (e.g., &my_string[..], or often implicitly via deref coercion), but converting a &str to an owned String usually involves allocating memory and copying the data (e.g., using .to_string() or String::from()).

18.3.2 `String` vs. `Vec<u8>`

While a String is internally backed by a buffer of bytes (like Vec<u8>), its primary difference is the UTF-8 guarantee. String methods ensure that the byte sequence remains valid UTF-8. If you need to handle arbitrary binary data, raw byte streams, or text in an encoding other than UTF-8, you should use Vec<u8> instead. Attempting to create a String from invalid UTF-8 byte sequences will result in an error or panic.

18.3.3 Creating and Modifying Strings

#![allow(unused)]
fn main() {
// Create an empty String
let mut s1 = String::new();

// Create from a string literal (&str)
let s2 = String::from("initial content");
let s3 = "initial content".to_string(); // Equivalent, often preferred style

// Appending content
let mut s = String::from("foo");
s.push_str("bar"); // Appends a &str slice. s is now "foobar"
s.push('!');       // Appends a single char. s is now "foobar!"
}

Appending uses similar reallocation strategies as Vec for amortized O(1) performance.

18.3.4 Concatenation

There are several ways to combine strings:

Using the + operator (via the add trait method): This operation consumes ownership of the left-hand String and requires a borrowed &str on the right.

#![allow(unused)]
fn main() {
let s1 = String::from("Hello, ");
let s2 = String::from("world!");
// s1 is moved here and can no longer be used directly.
// &s2 works because String derefs to &str.
let s3 = s1 + &s2;
println!("{}", s3); // Prints "Hello, world!"
// println!("{}", s1); // Compile Error: value used after move
}

Because + moves the left operand, chaining multiple additions can be inefficient and verbose (s1 + &s2 + &s3 + ...).

Using the format! macro: This is generally the most flexible and readable approach, especially for combining multiple pieces or non-string data. It does not take ownership of its arguments (it borrows them via references) and returns a newly allocated, owned String.

#![allow(unused)]
fn main() {
let name = "Rustacean";
let level = 99;
let s1 = String::from("Status: ");
let greeting = format!("{}{}! Your level is {}.", s1, name, level);
println!("{}", greeting); // Prints "Status: Rustacean! Your level is 99."
// s1, name, and level are still usable here because format! borrowed them.
println!("{} still exists.", s1);
}

18.3.5 UTF-8, Characters, and Indexing

Because String guarantees UTF-8, where characters can span multiple bytes (1 to 4), direct indexing by byte position (s[i]) to get a char is disallowed. This is a safety feature: a byte index might fall in the middle of a multi-byte character, leading to an invalid character boundary.

Instead, Rust provides methods to work with strings correctly:

Iterating over Unicode scalar values (char):

#![allow(unused)]
fn main() {
let hello = String::from("Здравствуйте"); // Russian "Hello" (multi-byte chars)
for c in hello.chars() {
    print!("'{}' ", c); // Prints 'З' 'д' 'р' 'а' 'в' 'с' 'т' 'в' 'у' 'й' 'т' 'е'
}
println!("\nNumber of chars: {}", hello.chars().count()); // 12 chars
}

Iterating over raw bytes (u8):

#![allow(unused)]
fn main() {
let hello = String::from("Здравствуйте");
for b in hello.bytes() {
    print!("{} ", b); // Prints the underlying UTF-8 bytes (2 bytes per char here)
}
println!("\nNumber of bytes: {}", hello.len()); // 24 bytes
}

Slicing (&s[start..end]): You can create &str slices using byte indices, but this will panic the current thread if the start or end indices do not fall exactly on UTF-8 character boundaries. Use with caution.

#![allow(unused)]
fn main() {
let s = String::from("hello");
let h = &s[0..1]; // Ok, slice is "h"

let multi_byte = String::from("नमस्ते"); // Hindi "Namaste", each char is 3 bytes
// The first char is at byte indices 0..3.
let first_char_slice = &multi_byte[0..3]; // Ok, slice is "न"
// let bad_slice = &multi_byte[0..1]; // PANIC! 1 is not on a character boundary
}

For operations sensitive to grapheme clusters (user-perceived characters, like ‘e’ + combining accent ‘´’), use external crates like unicode-segmentation.

18.3.6 Common `String` Methods

len() -> usize: Returns the length of the string in bytes (not characters). O(1).
is_empty() -> bool: Checks if the string has zero bytes. O(1).
contains(pattern: &str) -> bool: Checks if the string contains a given substring.
replace(from: &str, to: &str) -> String: Returns a new String with all occurrences of from replaced by to.
split(pattern) -> Split: Returns an iterator over &str slices separated by a pattern (char, &str, etc.).
trim() -> &str: Returns a &str slice with leading and trailing whitespace removed.
as_str() -> &str: Borrows the String as an immutable &str slice covering the entire string. Often done implicitly via deref coercion.

18.3.7 Summary: `String` vs. C Strings

Traditional C strings (char*, usually null-terminated) present several challenges that Rust’s String and &str system addresses:

Encoding Ambiguity: C strings lack inherent encoding information. They might be ASCII, Latin-1, UTF-8, or another encoding depending on context and platform. Rust’s String/&str guarantee UTF-8.
Length Calculation: Finding the length of a C string (strlen) requires scanning for the null terminator (\0), an O(n) operation. Rust’s String stores its byte length, making len() an O(1) operation. &str also includes the length as part of its fat pointer.
Memory Management: Manual allocation, resizing (malloc/realloc), and copying (strcpy/strcat) in C are common sources of buffer overflows and memory leaks. Rust’s String handles memory automatically and safely.
Mutability Risks: Modifying C strings in place requires careful buffer management to avoid overflows. String provides safe methods like push_str. &str is immutable, preventing accidental modification through slices.
Interior Null Bytes: C strings cannot contain null bytes (\0) as they signal termination. Rust Strings can contain \0 like any other valid UTF-8 character (though this is uncommon in text data).
Null Termination and FFI: Crucially, Rust Strings and &strs are not null-terminated. Passing a pointer from String::as_ptr() or a &str directly to a C function expecting a null-terminated const char* is unsafe and incorrect, as the C code might read past the end of the Rust string’s data. For safe interoperability when passing strings to C, Rust provides std::ffi::CString, which creates an owned, null-terminated byte sequence (checking for and prohibiting interior nulls). Interacting with C strings received from C typically uses std::ffi::CStr. (FFI details are covered elsewhere).

String and &str provide a robust, safe, and Unicode-aware system for handling text data, significantly improving upon the limitations and unsafety of traditional C strings, while offering specific mechanisms for safe C interoperability when needed.

Keyboard shortcuts

Rust for C-Programmers