18.3 The String Type

Rust’s String type represents a growable, mutable, owned sequence of UTF-8 encoded text. It is stored on the heap and automatically manages its memory, conceptually similar to Vec<u8> but specifically designed for string data with the critical guarantee that its contents are always valid UTF-8.

18.3.1 Understanding String vs. &str

This distinction is fundamental in Rust and often a point of confusion for newcomers:

  • String: An owned, heap-allocated buffer containing UTF-8 text. It owns the data it holds. It is mutable (can be modified, e.g., by appending text) and responsible for freeing its memory when it goes out of scope. Think of it like a Vec<u8> specialized for UTF-8.
  • &str (string slice): A borrowed, immutable view (a pointer and length) into a sequence of UTF-8 bytes. It does not own the data it points to. It can refer to part of a String, an entire String, or a string literal embedded in the program’s binary. String literals (e.g., "hello") have the type &'static str, meaning they are borrowed for the entire program’s lifetime. Think of &str like a &[u8] (slice of bytes) that is guaranteed to be valid UTF-8.

You can get an immutable &str slice from a String easily (e.g., &my_string[..], or often implicitly via deref coercion), but converting a &str to an owned String usually involves allocating memory and copying the data (e.g., using .to_string() or String::from()).

18.3.2 String vs. Vec<u8>

While a String is internally backed by a buffer of bytes (like Vec<u8>), its primary difference is the UTF-8 guarantee. String methods ensure that the byte sequence remains valid UTF-8. If you need to handle arbitrary binary data, raw byte streams, or text in an encoding other than UTF-8, you should use Vec<u8> instead. Attempting to create a String from invalid UTF-8 byte sequences will result in an error or panic.

18.3.3 Creating and Modifying Strings

#![allow(unused)]
fn main() {
// Create an empty String
let mut s1 = String::new();

// Create from a string literal (&str)
let s2 = String::from("initial content");
let s3 = "initial content".to_string(); // Equivalent, often preferred style

// Appending content
let mut s = String::from("foo");
s.push_str("bar"); // Appends a &str slice. s is now "foobar"
s.push('!');       // Appends a single char. s is now "foobar!"
}

Appending uses similar reallocation strategies as Vec for amortized O(1) performance.

18.3.4 Concatenation

There are several ways to combine strings:

  1. Using the + operator (via the add trait method): This operation consumes ownership of the left-hand String and requires a borrowed &str on the right.

    #![allow(unused)]
    fn main() {
    let s1 = String::from("Hello, ");
    let s2 = String::from("world!");
    // s1 is moved here and can no longer be used directly.
    // &s2 works because String derefs to &str.
    let s3 = s1 + &s2;
    println!("{}", s3); // Prints "Hello, world!"
    // println!("{}", s1); // Compile Error: value used after move
    }

    Because + moves the left operand, chaining multiple additions can be inefficient and verbose (s1 + &s2 + &s3 + ...).

  2. Using the format! macro: This is generally the most flexible and readable approach, especially for combining multiple pieces or non-string data. It does not take ownership of its arguments (it takes references).

    #![allow(unused)]
    fn main() {
    let name = "Rustacean";
    let level = 99;
    let s1 = String::from("Status: ");
    let greeting = format!("{}{}! Your level is {}.", s1, name, level);
    println!("{}", greeting); // Prints "Status: Rustacean! Your level is 99."
    // s1, name, and level are still usable here.
    println!("{} still exists.", s1);
    }

18.3.5 UTF-8, Characters, and Indexing

Because String guarantees UTF-8, where characters can span multiple bytes (1 to 4), direct indexing by byte position (s[i]) to get a char is disallowed. A byte index might fall in the middle of a multi-byte character, leading to invalid data if treated as a character boundary.

Instead, Rust provides methods to work with strings correctly:

  • Iterating over Unicode scalar values (char):
    #![allow(unused)]
    fn main() {
    let hello = String::from("Здравствуйте"); // Russian "Hello" (multi-byte chars)
    for c in hello.chars() {
        print!("'{}' ", c); // Prints 'З' 'д' 'р' 'а' 'в' 'с' 'т' 'в' 'у' 'й' 'т' 'е'
    }
    println!("\nNumber of chars: {}", hello.chars().count()); // 12 chars
    }
  • Iterating over raw bytes (u8):
    #![allow(unused)]
    fn main() {
    for b in hello.bytes() {
        print!("{} ", b); // Prints the underlying UTF-8 bytes (2 bytes per char here)
    }
    println!("\nNumber of bytes: {}", hello.len()); // 24 bytes
    }
  • Slicing (&s[start..end]): You can create &str slices using byte indices, but this will panic if the start or end indices do not fall exactly on UTF-8 character boundaries. Use with caution.
    #![allow(unused)]
    fn main() {
    let s = String::from("hello");
    let h = &s[0..1]; // Ok, slice is "h"
    
    let multi_byte = String::from("नमस्ते"); // Hindi "Namaste"
    let first_char_slice = &multi_byte[0..3]; // Ok, first char "न" is 3 bytes
    // let bad_slice = &multi_byte[0..1]; // PANIC! 1 is not on a char boundary
    }

For operations sensitive to grapheme clusters (user-perceived characters, like ‘e’ + combining accent ‘´’), use external crates like unicode-segmentation.

1.3.6 Common String Methods

  • len() -> usize: Returns the length of the string in bytes (not characters). O(1).
  • is_empty() -> bool: Checks if the string has zero bytes. O(1).
  • contains(pattern: &str) -> bool: Checks if the string contains a given substring.
  • replace(from: &str, to: &str) -> String: Returns a new String with all occurrences of from replaced by to.
  • split(pattern) -> Split: Returns an iterator over &str slices separated by a pattern (char, &str, etc.).
  • trim() -> &str: Returns a &str slice with leading and trailing whitespace removed.
  • as_str() -> &str: Borrows the String as an immutable &str slice covering the entire string. Often done implicitly via deref coercion.

18.3.7 Summary: String vs. C Strings

Traditional C strings (char*, usually null-terminated) present several challenges that Rust’s String and &str system addresses:

  • Encoding Ambiguity: C strings lack inherent encoding information. They might be ASCII, Latin-1, UTF-8, or another encoding depending on context and platform. Rust’s String/&str guarantee UTF-8.
  • Length Calculation: Finding the length of a C string (strlen) requires scanning for the null terminator (\0), an O(n) operation. Rust’s String stores its byte length, making len() an O(1) operation. &str also includes the length.
  • Memory Management: Manual allocation, resizing (malloc/realloc), and copying (strcpy/strcat) in C are common sources of buffer overflows and memory leaks. Rust’s String handles memory automatically and safely.
  • Mutability Risks: Modifying C strings in place requires careful buffer management to avoid overflows. String provides safe methods like push_str. &str is immutable, preventing accidental modification through slices.
  • Interior Null Bytes: C strings cannot contain null bytes (\0) as they signal termination. Rust Strings can contain \0 like any other valid UTF-8 character (though this is uncommon in text data).

String and &str provide a robust, safe, and Unicode-aware system for handling text data, significantly improving upon the limitations and unsafety of traditional C strings.