18.3 The String Type
Rust’s String type represents a growable, mutable, owned sequence of UTF-8 encoded text. Like Vec<T>, the String struct itself is a small object on the stack, containing a pointer to its heap-allocated text data, a length, and a capacity. This design ensures that only the text data is stored on the heap, with the stack-based struct managing the heap-allocated memory. A call to String::new() does not allocate memory on the heap; it creates an empty String struct on the stack. The heap allocation happens only when data is first added to the string. This automatic and lazy memory management is a key advantage over manual C-style string handling.
18.3.1 Understanding String vs. &str
This distinction is fundamental in Rust and often a point of confusion for newcomers:
String: An owned, heap-allocated buffer containing UTF-8 text. It owns the data it holds. It is mutable (can be modified, e.g., by appending text) and responsible for freeing its memory when it goes out of scope. Think of it like aVec<u8>specialized for UTF-8.&str(string slice): A borrowed, immutable view into a sequence of UTF-8 bytes. It consists of a pointer to the data and a length. It does not own the data it points to. It can refer to part of aString, an entireString, or a string literal embedded in the program’s binary.- String literals: Expressions like
"hello"in your code have the type&'static str. The'staticlifetime means the reference is valid for the entire duration of the program, because the underlying string data ("hello") is embedded directly into the program’s binary data segment and thus lives forever. - The
strtype: You might wonder aboutstrwithout the&.stritself is the primitive sequence type, but it’s an unsized type (Dynamically Sized Type or DST) because its length isn’t known at compile time. Because variables and function arguments must have a known size, Rust requires that we always interact withstrvia pointers like&str(a “fat pointer” containing address and length) orBox<str>(an owned pointer).&stris the ubiquitous borrowed form.
- String literals: Expressions like
You can get an immutable &str slice from a String easily (e.g., &my_string[..], or often implicitly via deref coercion), but converting a &str to an owned String usually involves allocating memory and copying the data (e.g., using .to_string() or String::from()).
18.3.2 String vs. Vec<u8>
While a String is internally backed by a buffer of bytes (like Vec<u8>), its primary difference is the UTF-8 guarantee. String methods ensure that the byte sequence remains valid UTF-8. If you need to handle arbitrary binary data, raw byte streams, or text in an encoding other than UTF-8, you should use Vec<u8> instead. Attempting to create a String from invalid UTF-8 byte sequences will result in an error or panic.
18.3.3 Creating and Modifying Strings
#![allow(unused)] fn main() { // Create an empty String let mut s1 = String::new(); // Create from a string literal (&str) let s2 = String::from("initial content"); let s3 = "initial content".to_string(); // Equivalent, often preferred style // Appending content let mut s = String::from("foo"); s.push_str("bar"); // Appends a &str slice. s is now "foobar" s.push('!'); // Appends a single char. s is now "foobar!" }
Appending uses similar reallocation strategies as Vec for amortized O(1) performance.
18.3.4 Concatenation
There are several ways to combine strings:
-
Using the
+operator (via theaddtrait method): This operation consumes ownership of the left-handStringand requires a borrowed&stron the right.#![allow(unused)] fn main() { let s1 = String::from("Hello, "); let s2 = String::from("world!"); // s1 is moved here and can no longer be used directly. // &s2 works because String derefs to &str. let s3 = s1 + &s2; println!("{}", s3); // Prints "Hello, world!" // println!("{}", s1); // Compile Error: value used after move }Because
+moves the left operand, chaining multiple additions can be inefficient and verbose (s1 + &s2 + &s3 + ...). -
Using the
format!macro: This is generally the most flexible and readable approach, especially for combining multiple pieces or non-string data. It does not take ownership of its arguments (it borrows them via references) and returns a newly allocated, ownedString.#![allow(unused)] fn main() { let name = "Rustacean"; let level = 99; let s1 = String::from("Status: "); let greeting = format!("{}{}! Your level is {}.", s1, name, level); println!("{}", greeting); // Prints "Status: Rustacean! Your level is 99." // s1, name, and level are still usable here because format! borrowed them. println!("{} still exists.", s1); }
18.3.5 UTF-8, Characters, and Indexing
Because String guarantees UTF-8, where characters can span multiple bytes (1 to 4), direct indexing by byte position (s[i]) to get a char is disallowed. This is a safety feature: a byte index might fall in the middle of a multi-byte character, leading to an invalid character boundary.
Instead, Rust provides methods to work with strings correctly:
- Iterating over Unicode scalar values (
char):#![allow(unused)] fn main() { let hello = String::from("Здравствуйте"); // Russian "Hello" (multi-byte chars) for c in hello.chars() { print!("'{}' ", c); // Prints 'З' 'д' 'р' 'а' 'в' 'с' 'т' 'в' 'у' 'й' 'т' 'е' } println!("\nNumber of chars: {}", hello.chars().count()); // 12 chars } - Iterating over raw bytes (
u8):#![allow(unused)] fn main() { let hello = String::from("Здравствуйте"); for b in hello.bytes() { print!("{} ", b); // Prints the underlying UTF-8 bytes (2 bytes per char here) } println!("\nNumber of bytes: {}", hello.len()); // 24 bytes } - Slicing (
&s[start..end]): You can create&strslices using byte indices, but this will panic the current thread if thestartorendindices do not fall exactly on UTF-8 character boundaries. Use with caution.
For operations sensitive to grapheme clusters (user-perceived characters, like ‘e’ + combining accent ‘´’), use external crates like#![allow(unused)] fn main() { let s = String::from("hello"); let h = &s[0..1]; // Ok, slice is "h" let multi_byte = String::from("नमस्ते"); // Hindi "Namaste", each char is 3 bytes // The first char is at byte indices 0..3. let first_char_slice = &multi_byte[0..3]; // Ok, slice is "न" // let bad_slice = &multi_byte[0..1]; // PANIC! 1 is not on a character boundary }unicode-segmentation.
18.3.6 Common String Methods
len() -> usize: Returns the length of the string in bytes (not characters).O(1).is_empty() -> bool: Checks if the string has zero bytes.O(1).contains(pattern: &str) -> bool: Checks if the string contains a given substring.replace(from: &str, to: &str) -> String: Returns a newStringwith all occurrences offromreplaced byto.split(pattern) -> Split: Returns an iterator over&strslices separated by a pattern (char, &str, etc.).trim() -> &str: Returns a&strslice with leading and trailing whitespace removed.as_str() -> &str: Borrows theStringas an immutable&strslice covering the entire string. Often done implicitly via deref coercion.
18.3.7 Summary: String vs. C Strings
Traditional C strings (char*, usually null-terminated) present several challenges that Rust’s String and &str system addresses:
- Encoding Ambiguity: C strings lack inherent encoding information. They might be ASCII, Latin-1, UTF-8, or another encoding depending on context and platform. Rust’s
String/&strguarantee UTF-8. - Length Calculation: Finding the length of a C string (
strlen) requires scanning for the null terminator (\0), anO(n)operation. Rust’sStringstores its byte length, makinglen()anO(1)operation.&stralso includes the length as part of its fat pointer. - Memory Management: Manual allocation, resizing (
malloc/realloc), and copying (strcpy/strcat) in C are common sources of buffer overflows and memory leaks. Rust’sStringhandles memory automatically and safely. - Mutability Risks: Modifying C strings in place requires careful buffer management to avoid overflows.
Stringprovides safe methods likepush_str.&stris immutable, preventing accidental modification through slices. - Interior Null Bytes: C strings cannot contain null bytes (
\0) as they signal termination. RustStrings can contain\0like any other valid UTF-8 character (though this is uncommon in text data). - Null Termination and FFI: Crucially, Rust
Strings and&strs are not null-terminated. Passing a pointer fromString::as_ptr()or a&strdirectly to a C function expecting a null-terminatedconst char*is unsafe and incorrect, as the C code might read past the end of the Rust string’s data. For safe interoperability when passing strings to C, Rust providesstd::ffi::CString, which creates an owned, null-terminated byte sequence (checking for and prohibiting interior nulls). Interacting with C strings received from C typically usesstd::ffi::CStr. (FFI details are covered elsewhere).
String and &str provide a robust, safe, and Unicode-aware system for handling text data, significantly improving upon the limitations and unsafety of traditional C strings, while offering specific mechanisms for safe C interoperability when needed.