Skip to content

ARROW-12306: [Rust][datafusion] Read CSV format text from stdin or memory#10066

Closed
heymind wants to merge 1 commit intoapache:masterfrom
heymind:ysw/csv_from_reader
Closed

ARROW-12306: [Rust][datafusion] Read CSV format text from stdin or memory#10066
heymind wants to merge 1 commit intoapache:masterfrom
heymind:ysw/csv_from_reader

Conversation

@heymind
Copy link
Copy Markdown

@heymind heymind commented Apr 16, 2021

Background from JIRA:

I'm building a command line tool that can run SQL queries on text files (csv, json-line ..) . But the CsvExec in datafusion can only read csv text from files currently. I have checked its inner implantation the csv reader in arrow, anything impl Read could be a valid input.

@github-actions
Copy link
Copy Markdown

Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution @heymind -- the code and tests look nice.

I had a few suggestions, the most important of which is the behavior of CsvExec when Clone. After that I think this PR is more or less ready to go

cc @andygrove and @Dandandan

)?))
}
SourceReader::Reader(rdr) => {
if let Some(rdr) = rdr.lock().unwrap().take() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend checking here that partition==0 and returning an internal error otherwise:

Something like

if partition != 0 { 
   Err(DataFusionError::Internal("Only partition 0 is valid when CSV comes from a reader"))
}
                            .

)?))
} else {
Err(DataFusionError::Execution(
"You can only read once if the data comes from a reader"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"You can only read once if the data comes from a reader"
"Error reading CSV: Data can only be read a single time when the source is a reader"

filenames: filenames.clone(),
}
}
SourceReader::Reader(_) => Self::Reader(Mutex::new(None)),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might cause some non trivial confusion -- namey that Clone'ing a SourceReader will not clone the underlying reader. Thus any Clone'd CsvExec won't be usable at at all (it will generate an error)

I wonder if CsvExec really needs to be Clone at all -- like can we just remove the Clone derivation:

#[derive(Debug)]
pub struct CsvExec {

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it's unnecessary for CsvExec to be Clone. But if we remove the Clone derivation, will it introduce a breaking change ?

If the CsvExec is built from source files ( not from a reader ) , the Clone will act as expected.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing Clone would be a breaking change I agree -- though I am not sure how many people Clone physical plans

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But CsvExec::with_new_children requires itself Clone ...

Some(s) => s.clone(),
None => {
return Err(DataFusionError::Execution(
"Schema must be provided".to_string(),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Schema must be provided".to_string(),
"Schema must be provided to CsvRead".to_string(),

}
}
/// Loads CSV data from a reader
pub struct CsvRead<R: Read> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there is any way to reduce the duplication between CsvRead<R> and CsvFile -- as you have done for CsvExec. That way we can reuse the same tests for things like schema matching.

Since the Reader gets Boxed anyways for execution, I don't think there is any performance difference using something that is generic over R vs a Box<R>

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. Would you mind to introduce another dependency crate either ? It's more clear than introducing a new enum.

pub struct CsvFile {
    source: Either<Box<dyn Read>,String>,
   ....
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are trying to keep the number of dependencies down (as we expect applications to be built on top of DataFusion) so I think we should use a specific enum rather than another crate

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Apr 19, 2021

The Apache Arrow Rust community is moving the Rust implementation into its own dedicated github repositories arrow-rs and arrow-datafusion. It is likely we will not merge this PR into this repository

Please see the mailing-list thread for more details

We expect the process to take a few days and will follow up with a migration plan for the in-flight PRs.

1 similar comment
@alamb
Copy link
Copy Markdown
Contributor

alamb commented Apr 19, 2021

The Apache Arrow Rust community is moving the Rust implementation into its own dedicated github repositories arrow-rs and arrow-datafusion. It is likely we will not merge this PR into this repository

Please see the mailing-list thread for more details

We expect the process to take a few days and will follow up with a migration plan for the in-flight PRs.

@heymind
Copy link
Copy Markdown
Author

heymind commented Apr 25, 2021

apache/datafusion#54

@heymind heymind closed this Apr 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants