10

RFC: Reading into uninitialized buffers by sfackler · Pull Request #2930 · rust-...

 3 years ago
source link: https://github.com/rust-lang/rfcs/pull/2930
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Member

sfackler commented on May 19

edited by scottmcm

Member

shepmaster left a comment

Has any exploration been done to see how implementing this in the standard library would be able to replace existing similar usages such as those mentioned as prior art?

A big point of the standard library is to have reusable abstractions, so it would be nice to have some confidence that it is an abstraction that would work outside of just the usage of Read.

text/0000-read-buf.md

Outdated

Show resolved

Member

Author

sfackler commented on May 19

Yeah, this should be applicable to AsyncRead without any real issue.

The [`tokio::io::AsyncRead`](https://docs.rs/tokio/0.2.21/tokio/io/trait.AsyncRead.html) trait has a somewhat similar

approach, with a `prepare_uninitialized_buffer` method which takes a `&mut [MaybeUninit<u8>]` slice and initializes it

if necessary.

shepmaster on May 19

Member

It might be nice to explicitly state that it's believed that this RFC could replace this usage (even if the crate itself decided not to for whatever reason).

pub struct ReadBuf<'a> {

buf: &'a mut [MaybeUninit<u8>],

written: usize,

initialized: usize,

}

Comment on lines

253 to 257

shepmaster on May 19

Member

At the risk of opening a can of worms, this feels very similar to arrayvec. This makes me wonder a few points:

  1. Could this be made generic over u8, so it could be used for other similar "partial initialization" cases?

  2. Could there be an owning variant of this so that we could do something like

    let mut buf = ReadBufOwned::<const 8192>();

    I'm using unstable const generics here, sadly, but I think the ergonomics are nice.

Because of #2, I'd also like to bikeshed about the name ReadBuf a little. We have Path and PathBuf, where *Buf is the owned version, so I originally wanted to type ReadBufBuf...

sfackler on May 19

Author

Member

1. Could this be made generic over u8, so it could be used for other similar "partial initialization" cases?

It could be made generic, but I'm not sure the equivalence is really tight enough. The API of ReadBuf is centered entirely around progressive initialization, which doesn't necessarily align with the API for a general purpose vec-style data structure.

2. Could there be an owning variant of this so that we could do something like

That's covered briefly in the future possibilities section.

shepmaster on May 19

Member

That's covered briefly in the future possibilities section.

Are you referring to this section?

There is probably some kind of abstraction that could be defined to encapsulate that logic.

If so, perhaps that section could be extended a bit; I've re-read it a few times and still can't see a connection to an owning version.

golddranks on May 19

edited

I, too, saw some avenues for possible generalizations: currently one is unable to use Vec in no_std crates. It's often preferable for unopinionated libraries to let the caller to decide the "backing store" of a buffer. Slices fit the bill for some cases, but in cases where there is input-dependent or otherwise unknown amount of data stored in the buffer, it's generally desirable to have a type that keeps track the "written part" and "uninitialized part", for both ergonomics and correctness (less book-keeping for the user) and performance (skipping initialization). In this sense, Vec is pretty awesome, but it comes at the cost of an allocation and dependence on allocator.

It's also desirable to have such type as a part of the "common vocabulary" of the ecosystem, but alas, std doesn't have a type that fits the bill at the moment.

Thus, I think it would be valuable to provide a type that is generalized over the contained type, and keeps track the "written"/"uninitialized" split. We could then only allow Reading, but Writeing to it, and consider it as a generic buffer type for some other uses, such as a replacement for Vec for libraries that need to work in no_std contexts.

Shnatsel on May 28

A Vec-like abstraction that uses a fixed-size slice of uninitialized memory as backing storage would indeed be a useful abstraction, but requires basically reimplementing the entire Vec if done outside the standard library. I've written at length about this issue here. That writeup may be a bit outdated, but I'm glad to elaborate if you believe this is relevant to this discussion.

One related thing to think about is how well this proposal would mesh with async reads where the kernel is the buffer owner, not the caller. See https://vorner.github.io/2019/11/03/io-uring-mental-experiments.html for some reasoning.

Member

Author

sfackler commented on May 20

I believe this is basically orthogonal to an io_uring interface.

burdges commented on May 20

edited

Is there an existing crate to develop ReadBuf and friends outside std for people who want to use this right now? I'm thinking like other "std v2" projects like futures before and concurrently no_std Error?

Member

joshtriplett commented on May 20

I think it'd be valuable to specify a feature (even if for internal use only) that allows a trait to specify a list of lists of methods that constitute a minimum implementation, such that an impl that doesn't provide at least one of the minimum implementations will get a warning.

Member

Author

sfackler commented on May 20

@burdges I have some of the code in https://github.com/sfackler/read-buf, but it currently uses an older version of ReadBuf that doesn't track the written cursor internally.

monomorphization or specialization.

4. It needs to work with both normal and vectored IO (via `read_vectored`).

5. It needs to be composable. Readers are very commonly nested (e.g. `GzipReader<TlsStream<TcpStream>>`), and wrapper

readers should be able to opt-in to fast paths supported by their inner reader.

abonander on May 20

edited

Every proposal for reading into uninitialized memory that revolved around adding a new method to the Read trait has been criticized for potentially leaving performance on the table if a wrapper type forgets to forward the new method.

I've never seen it seriously discussed to have a warn-by-default lint for trait methods that are meant to be overridden. This could apply to Iterator::size_hint() as well, which also leaves performance on the table if not forwarded.

abonander on May 20

I'd gladly take the task of implementing such a lint myself, in fact.

withoutboats on May 22

Contributor

In my opinion, "potentially leaving performance on the table" is fine - and preferable to overwhelming users new to these interfaces with all of the information they need to implement the absolute perfect interface in every case. We need to worry about information overload and ensure we gradually onboard users to exploiting the full power of these interfaces.

Any lints like this should be opt-in.

rpjohnst commented on May 20

Regarding the unresolved question of whether to include the written field... what sort of performance impact would we expect there, from the two choices?

With the written field included in ReadBuf, there can be fewer asserts, but the struct itself is larger, on top of becoming a by-reference argument where there once were direct by-value argument and return value. These are probably both mitigated by #[inline], but then there's the question of the impact on compile times.

Without the written field, the struct is smaller, but there are more asserts checking impl correctness. How easy is it to optimize out those asserts? Are there idioms people will learn similar to those that make bounds checks more optimizable?

Or is this difference lost in the noise because we're doing I/O? Are there cases where this might not be true?

Contributor

djc commented on May 20

It might be nice if there was a safe interface for writing to the uninitialized part of a ReadBuf from an already existing slice of buffered data (that is, is there a useful intersection of "safe interface" and "no need to unnecessarily initialize").

I did not read the Dropbox Paper document now (and I think I would feel slightly better if the RFC didn't rely on linking to it). Does it cover instead having a ReadBuf trait, with the Read impl taking that type as an argument (with default)? This could potentially allow some abstraction over things that could be wrapped, as some others have mentioned upthread.

It might be nice if there was a safe interface for writing to the uninitialized part of a ReadBuf from an already existing slice of buffered data (that is, is there a useful intersection of "safe interface" and "no need to unnecessarily initialize").

Seems like you want ReadBuf::append, which is already included.

///

/// The caller must ensure that `n` unwritten bytes of the buffer have already been initialized.

#[inline]

pub unsafe fn assume_initialized(&mut self, n: usize) { ... }

RalfJung on May 20

Member

assume_init would be more consistent with MaybeUninit::assume_init. But I guess at that point all "initialized" functions here should be called "init" instead. Maybe this is worth putting down as an open question to be decided later?

/// buffer that has been logically written to, a region that has been initialized at some point but not yet logically

/// written to, and a region at the end that is fully uninitialized. The written-to region is guaranteed to be a

/// subset of the initialized region.

pub struct ReadBuf<'a> {

RalfJung on May 20

Member

Maybe explicitly state that written <= initialized <= buf.len() is an invariant of this type?

sfackler on May 21

Author

Member

The "the written-to region is guaranteed to be a subset of the initialized region" bit is trying to say that. Maybe it should be clarified?

RalfJung on May 21

Member

I just feel like this is worth repeating in math/Rust notation -- kind of the ultimate clarification. ;)

pickfire on May 22

Contributor

It took me a while to understand that, having the mathematical notation would make it easier to understand.

///

/// Panics if `self.remaining()` is less than `n`.

#[inline]

pub fn initialize_unwritten_to(&mut self, n: usize) -> &mut [u8] { ... }

RalfJung on May 20

Member

To me, the "to" here sounds like it defines the target of the operation, like "write X to Y". Also the relation with "up to index n" is a bit weak since this is not index n in the buffer, it is index n in the uninitialized part of the buffer (I think).

Maybe initialize_unwritten_n or initialize_unwritten_len or initialize_unwritten_with_len or so?

RalfJung on May 20

edited

Member

Maybe a better set of names for the three methods that give access to the unwritten part would be unwritten, unwritten_len (both of which return &mut [u8]), and unwritten_maybe_uninit (returning &mut [MaybeUninit<u8>]).

sfackler on May 21

Author

Member

I do think it's somewhat important for the names of these methods to indicate that they're potentially doing significant amounts of work.

RalfJung on May 21

Member

Hm, that's fair. OTOH, having the most dangerous method have the shortest name (unwritten_mut) also doesn't seem great.

Contributor

Amanieu commented on May 21

Have you considered renaming written to filled? I found written a bit confusing at first glance since this type is used when reading data from a file, not writing.

Member

Author

sfackler commented on May 21

Ooh yeah, filled is much cleaner than written! I will update with that and @RalfJung's method name suggestions tonight.

rfcbot commented on Sep 8

The final comment period, with a disposition to merge, as per the review above, is now complete.

As the automated representative of the governance process, I would like to thank the author for their work and everyone else who contributed.

The RFC will be merged soon.

Member

Author

sfackler commented on Sep 8

edited

While preparing the PR to implement the new APIs in std, I came across a bit of a footgun in the current design: a malicious Read implementation can replace the ReadBuf passed to it with a different one. If the code using the reader is just looking at the number of bytes filled/initialized in the ReadBuf (e.g. the implementation of read_to_end in the RFC text), this can cause soundness problems as the initialized count no longer refers to the buffer that the code assumes it is:

impl Read for BadReader {
    fn read_buf(&mut self, buf: &mut ReadBuf<'_>) -> io::Result<()> {
        *buf = ReadBuf::new(Box::leak(vec![0; buf.capacity()].into_boxed_slice()));
        buf.add_filled(buf.remaining());
        Ok(())
    }
}

To the best of my knowledge this can only be done by leaking a Vec<u8> like in the example above, so this should only be a problem with actively malicious implementations. It can be worked around in the calling code by checking that the buffer address after the call to read_buf matches the expected value, but that's IMO a pretty sketchy thing to have to remember to do.

I think the fix is to prevent the Read implementations from replacing the buffer. @Amanieu suggested using Pin, but that feels like a bit of an abuse of the Pin interface - ReadBuf is always movable after all! We just need to prevent replacement in this one place. The alternative is to essentially wrap the mutable reference in another type which is passed by value into read_buf:

pub struct ReadBuf<'a> { ... }

impl<'a> ReadBuf<'a> {
    pub fn as_ref(&mut self) -> ReadBufRef<'_> { ... }
}

pub struct ReadBufRef<'a> { ... }

pub trait Read {
    fn read_buf(&mut self, buf: ReadBufRef<'_>) -> io::Result<()> { ... }
}

This unfortunately results in two types that will have very similar APIs. I think I'm going to land the initial PR with the API proposed by the RFC text, but this issue will need to be fixed before stabilization.

Contributor

Diggsey commented on Sep 8

edited

a malicious Read implementation can replace the ReadBuf passed to it with a different one.

Doesn't the lifetime parameter make this difficult? Could it be solved by making it invariant so you can't just coerce a ReadBuf with a static lifeitme?

Member

RalfJung commented on Sep 8

edited

Difficult but not impossible if you have a 'static buffer, like via Box::leak.

Member

Author

sfackler commented on Sep 8

It does not - Box::leak will produce a reference with any lifetime.

Thomasdezeeuw commented on Oct 15

edited

I would like to propose a different API based on a trait rather than a new type. It allows &mut [u8] to be used in the same way it is now, but allows easier use of Vec<u8> and any custom buffer types.

The trait is following.

/// Trait to make easier to work with slices of (uninitialised) bytes.
///
/// This is implemented for common types such as `&mut[u8]` and `Vec<u8>`.
pub trait Bytes {
    /// Returns itself as a slice of bytes that may or may not be initialised.
    /// 
    /// # Unsafety
    /// 
    /// Caller must ensure to only write valid bytes to the returned slice.
    unsafe fn as_bytes(&mut self) -> &mut [MaybeUninit<u8>];

    /// Update the length of the byte slice.
    ///
    /// # Safety
    ///
    /// The caller must ensure that at least `n` of the bytes returned by
    /// [`Bytes::as_bytes`] are initialised.
    unsafe fn update_length(&mut self, n: usize);
}

It should be used in the following way:

  1. The read implementation calls Byte::as_bytes and writes bytes into the returned slice.
  2. The read implementation calls Bytes::update_length updating the buffers length.
  3. The caller now has a correctly sized buffer.

This my implementation for TcpStream::try_recv for example:

impl TcpStream {
    pub fn try_recv<B>(&mut self, mut buf: B) -> io::Result<usize>
    where
        B: Bytes,
    {
        let dst = buf.as_bytes();
        debug_assert!(
            !dst.is_empty(),
            "called `TcpStream::try_recv with an empty buffer"
        );
        syscall!(recv(
            self.socket.as_raw_fd(),
            dst.as_mut_ptr().cast(),
            dst.len(),
            0, // Flags.
        ))
        .map(|read| {
            let read = read as usize;
            // Safety: just read the bytes.
            unsafe { buf.update_length(read) }
            read
        })
    }
}

Some implementations of Bytes:

For slices, both using [u8] and [MaybeUninit<u8>].

impl Bytes for [MaybeUninit<u8>] {
    fn as_bytes(&mut self) -> &mut [MaybeUninit<u8>] {
        self
    }

    unsafe fn update_length(&mut self, _: usize) {
        // Can't update the length of a slice.
    }
}

impl Bytes for [u8] {
    fn as_bytes(&mut self) -> &mut [MaybeUninit<u8>] {
        // Safety: `MaybeUninit<u8>` is guaranteed to have the same layout as
        // `u8` so it same to cast the pointer.
        unsafe { &mut *(self as *mut [u8] as *mut [MaybeUninit<u8>]) }
    }

    unsafe fn update_length(&mut self, _: usize) {
        // Can't update the length of a slice.
    }
}

This results is basically the same usage as Read::read currently because the user still needs to slice the buffer properly themselves, but does allow the use of MaybeUninit<u8>. However the implement for Vec<u8> is much nicer (see usage below).

impl Bytes for Vec<u8> {
    fn as_bytes(&mut self) -> &mut [MaybeUninit<u8>] {
        // Safety: `Vec` ensures the pointer is correct for us. The pointer is
        // at least valid for start + `Vec::capacity` bytes, a range we stay
        // within.
        unsafe {
            let len = self.len();
            let data_ptr = self.as_mut_ptr().add(len).cast();
            // NOTE: `Vec` ensure capacity >= len, so this is always >= 0.
            let capacity_left = self.capacity() - len;
            slice::from_raw_parts_mut(data_ptr, capacity_left)
        }
    }

    unsafe fn update_length(&mut self, n: usize) {
        // Safety: caller must ensure that `n` is valid.
        self.set_len(self.len() + n);
    }
}

This allow the following to just work as expected.

let mut stream = TcpStream::connect(address)?;
let mut buf = Vec::with_capacity(4 * 1024); // 4 KB.
let n = stream.recv(&mut buf)?;
// `buf` now contains `n` bytes with it length set to `buf.len() + n`.
// Tthe bytes are added to the `Vec`, not overwritten.

WaffleLapkin commented on Oct 15

edited

@Thomasdezeeuw please note that your impl Bytes for [u8] is unsound. While converting Box<[T]> -> Box<[MaybeUninit<T>]> or &[T] -> &[MaybeUninit<T>] is ok, converting &mut [T] to &mut [MaybeUninit<T>] is UB as you may write uninit values to T.

let mut buf = [0u8];
let maybe = unsafe { &mut *((&mut buf) as *mut [u8] as *mut [MaybeUninit<u8>]) };
// After this line `buf` will contain uninitialized value, which is UB
maybe[0] = MaybeUninit::uninit();

See also: rust-lang/rust#66699

But note, @Thomasdezeeuw, that your suggestion can be fixed the same way an actual crate with a similar API to your suggestion did: inventing (and using) an &out [u8] reference abstraction: https://docs.rs/uninit/0.4.0/uninit/out_ref/struct.Out.html

It's basically &mut [u8], but for the ability to write garbage to the pointee, which is what makes &mut [u8] -> &mut [MU<u8>] unsound.

@WaffleLapkin, @danielhenrymantilla good points. I didn't think about that case. I could be somewhat easily fixed by making Bytes::as_bytes unsafe and requiring the caller to only write valid bytes (I've updated my comment #2930 (comment)). Or as @danielhenrymantilla suggested some kind of wrapper type.

unsound unsafe code) is to fail to actually write useful data into the buffer. Code using a `BrokenReader` may see bad

data in the buffer, but the bad data at least has defined contents now!

Note that `read` is still a required method of the `Read` trait. It can be easily written to delegate to `read_buf`:

cramertj on Oct 16

Member

I'm personally imagining that I'll very quickly pull out this macro:

macro_rules! default_read_impl {
    () => {
        fn read(&mut self, buf: &mut [u8]) -> std::io::Result<usize> {
            let mut buf = std::io::ReadBuf::new(buf);
            std::io::Read::read_buf(self, &mut buf)?;
            std::result::Result::Ok(buf.filled().len())
        }
    }
}

impl Read for SomeReader {
    default_read_impl!();
    fn read_buf(&mut self, buf: &mut ReadBuf<'_>) -> io::Result<()> {
        ...
    }
}

I'm not sure whether that would be worth including in std. My gut feeling is "no", but that it's going to be a very mild annoyance to copy-paste into every new crate that implements lots of Read types.

maybe Read::read should have a default implementation and also have an annotation making rustc require at least one of Read::read, Read::read_buf, or the vectored read functions must be implemented.

/// # Safety

///

/// The caller must not de-initialize portions of the buffer that have already been initialized.

#[inline]

pub unsafe fn unfilled_mut(&mut self) -> &mut [MaybeUninit<u8>] { ... }

Comment on lines

+306 to +310

Darksonn on Oct 24

I would like to propose that the safety section is expanded to the following:

The caller must not de-initialize portions of the buffer that have already been initialized.
This includes any bytes in the region marked as uninitialized by ReadBuf.

Without a comment like this, it is unclear whether it is sound to put an UninitSlice from the bytes crate into an ReadBuf without first initializing the memory.

danielhenrymantilla on Oct 24

Or simply change that to return an Out<[u8]> to get rid of the unsafe altogether: even though I like the ReadBuf abstraction since it tracks the number of initialized elements (which an Out<[u8]> does not), I still think that an out-references abstraction is both intuitive and less error-prone than &mut MaybeUninit<...>s, and something that is very tied to "writing into potentially-uninit stuff", so the &mut MU part of the ReadBuf API ought to be replaced with that. Being able to, for instance, feature a non-unsafe version of this method is just one example of that (and then, if people were to need exactly an &mut [MU<u8>], they could call the unsafe .as_uninit_mut() method, who has a well-detailed section about how and why writing uninit bytes to the pointee is unsound).

Contributor

nikomatsakis commented on Oct 28

edited

Huzzah! The @rust-lang/libs team decided in September to accept this RFC. I have merged it into the repository. The tracking issue is rust-lang/rust#78485, if you would like to follow along from there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK