ARROW-12306: [Rust][datafusion] Read CSV format text from stdin or memory#10066
ARROW-12306: [Rust][datafusion] Read CSV format text from stdin or memory#10066heymind wants to merge 1 commit intoapache:masterfrom heymind:ysw/csv_from_reader
Conversation
alamb
left a comment
There was a problem hiding this comment.
Thank you for the contribution @heymind -- the code and tests look nice.
I had a few suggestions, the most important of which is the behavior of CsvExec when Clone. After that I think this PR is more or less ready to go
cc @andygrove and @Dandandan
| )?)) | ||
| } | ||
| SourceReader::Reader(rdr) => { | ||
| if let Some(rdr) = rdr.lock().unwrap().take() { |
There was a problem hiding this comment.
I recommend checking here that partition==0 and returning an internal error otherwise:
Something like
if partition != 0 {
Err(DataFusionError::Internal("Only partition 0 is valid when CSV comes from a reader"))
}
.
| )?)) | ||
| } else { | ||
| Err(DataFusionError::Execution( | ||
| "You can only read once if the data comes from a reader" |
There was a problem hiding this comment.
| "You can only read once if the data comes from a reader" | |
| "Error reading CSV: Data can only be read a single time when the source is a reader" |
| filenames: filenames.clone(), | ||
| } | ||
| } | ||
| SourceReader::Reader(_) => Self::Reader(Mutex::new(None)), |
There was a problem hiding this comment.
This might cause some non trivial confusion -- namey that Clone'ing a SourceReader will not clone the underlying reader. Thus any Clone'd CsvExec won't be usable at at all (it will generate an error)
I wonder if CsvExec really needs to be Clone at all -- like can we just remove the Clone derivation:
#[derive(Debug)]
pub struct CsvExec {
There was a problem hiding this comment.
I agree it's unnecessary for CsvExec to be Clone. But if we remove the Clone derivation, will it introduce a breaking change ?
If the CsvExec is built from source files ( not from a reader ) , the Clone will act as expected.
There was a problem hiding this comment.
Removing Clone would be a breaking change I agree -- though I am not sure how many people Clone physical plans
There was a problem hiding this comment.
But CsvExec::with_new_children requires itself Clone ...
| Some(s) => s.clone(), | ||
| None => { | ||
| return Err(DataFusionError::Execution( | ||
| "Schema must be provided".to_string(), |
There was a problem hiding this comment.
| "Schema must be provided".to_string(), | |
| "Schema must be provided to CsvRead".to_string(), |
| } | ||
| } | ||
| /// Loads CSV data from a reader | ||
| pub struct CsvRead<R: Read> { |
There was a problem hiding this comment.
I wonder if there is any way to reduce the duplication between CsvRead<R> and CsvFile -- as you have done for CsvExec. That way we can reuse the same tests for things like schema matching.
Since the Reader gets Boxed anyways for execution, I don't think there is any performance difference using something that is generic over R vs a Box<R>
There was a problem hiding this comment.
Okay. Would you mind to introduce another dependency crate either ? It's more clear than introducing a new enum.
pub struct CsvFile {
source: Either<Box<dyn Read>,String>,
....
}There was a problem hiding this comment.
I think we are trying to keep the number of dependencies down (as we expect applications to be built on top of DataFusion) so I think we should use a specific enum rather than another crate
|
The Apache Arrow Rust community is moving the Rust implementation into its own dedicated github repositories arrow-rs and arrow-datafusion. It is likely we will not merge this PR into this repository Please see the mailing-list thread for more details We expect the process to take a few days and will follow up with a migration plan for the in-flight PRs. |
1 similar comment
|
The Apache Arrow Rust community is moving the Rust implementation into its own dedicated github repositories arrow-rs and arrow-datafusion. It is likely we will not merge this PR into this repository Please see the mailing-list thread for more details We expect the process to take a few days and will follow up with a migration plan for the in-flight PRs. |
Background from JIRA: