ARROW-12306: [Rust][datafusion] Read CSV format text from stdin or memory by heymind · Pull Request #10066 · apache/arrow

heymind · 2021-04-16T02:25:22Z

Background from JIRA:

I'm building a command line tool that can run SQL queries on text files (csv, json-line ..) . But the CsvExec in datafusion can only read csv text from files currently. I have checked its inner implantation the csv reader in arrow, anything impl Read could be a valid input.

github-actions · 2021-04-16T02:25:46Z

https://issues.apache.org/jira/browse/ARROW-12306

alamb

Thank you for the contribution @heymind -- the code and tests look nice.

I had a few suggestions, the most important of which is the behavior of CsvExec when Clone. After that I think this PR is more or less ready to go

cc @andygrove and @Dandandan

alamb · 2021-04-18T10:13:49Z

+                )?))
+            }
+            SourceReader::Reader(rdr) => {
+                if let Some(rdr) = rdr.lock().unwrap().take() {


I recommend checking here that partition==0 and returning an internal error otherwise:

Something like

if partition != 0 { Err(DataFusionError::Internal("Only partition 0 is valid when CSV comes from a reader")) } .

alamb · 2021-04-18T10:14:27Z

+                    )?))
+                } else {
+                    Err(DataFusionError::Execution(
+                        "You can only read once if the data comes from a reader"


Suggested change

"You can only read once if the data comes from a reader"

"Error reading CSV: Data can only be read a single time when the source is a reader"

alamb · 2021-04-18T10:18:33Z

+                    filenames: filenames.clone(),
+                }
+            }
+            SourceReader::Reader(_) => Self::Reader(Mutex::new(None)),


This might cause some non trivial confusion -- namey that Clone'ing a SourceReader will not clone the underlying reader. Thus any Clone'd CsvExec won't be usable at at all (it will generate an error)

I wonder if CsvExec really needs to be Clone at all -- like can we just remove the Clone derivation:

#[derive(Debug)] pub struct CsvExec {

I agree it's unnecessary for CsvExec to be Clone. But if we remove the Clone derivation, will it introduce a breaking change ?

If the CsvExec is built from source files ( not from a reader ) , the Clone will act as expected.

Removing Clone would be a breaking change I agree -- though I am not sure how many people Clone physical plans

But CsvExec::with_new_children requires itself Clone ...

alamb · 2021-04-18T10:19:38Z

+            Some(s) => s.clone(),
+            None => {
+                return Err(DataFusionError::Execution(
+                    "Schema must be provided".to_string(),


Suggested change

"Schema must be provided".to_string(),

"Schema must be provided to CsvRead".to_string(),

alamb · 2021-04-18T10:22:07Z

    }
 }
+/// Loads CSV data from a reader
+pub struct CsvRead<R: Read> {


I wonder if there is any way to reduce the duplication between CsvRead<R> and CsvFile -- as you have done for CsvExec. That way we can reuse the same tests for things like schema matching.

Since the Reader gets Boxed anyways for execution, I don't think there is any performance difference using something that is generic over R vs a Box<R>

Okay. Would you mind to introduce another dependency crate either ? It's more clear than introducing a new enum.

pub struct CsvFile { source: Either<Box<dyn Read>,String>, .... }

I think we are trying to keep the number of dependencies down (as we expect applications to be built on top of DataFusion) so I think we should use a specific enum rather than another crate

alamb · 2021-04-19T10:21:13Z

The Apache Arrow Rust community is moving the Rust implementation into its own dedicated github repositories arrow-rs and arrow-datafusion. It is likely we will not merge this PR into this repository

Please see the mailing-list thread for more details

We expect the process to take a few days and will follow up with a migration plan for the in-flight PRs.

alamb · 2021-04-19T10:23:21Z

The Apache Arrow Rust community is moving the Rust implementation into its own dedicated github repositories arrow-rs and arrow-datafusion. It is likely we will not merge this PR into this repository

Please see the mailing-list thread for more details

We expect the process to take a few days and will follow up with a migration plan for the in-flight PRs.

heymind · 2021-04-25T01:51:35Z

apache/datafusion#54

github-actions Bot added Component: Rust - DataFusion Component: Rust labels Apr 16, 2021

[rust][datafusion] load csv data from a reader

2e1e242

alamb approved these changes Apr 18, 2021

View reviewed changes

This was referenced Apr 25, 2021

Read CSV format text from stdin or memory apache/datafusion#53

Closed

Read CSV format text from stdin or memory apache/datafusion#54

Merged

heymind closed this Apr 25, 2021

	"You can only read once if the data comes from a reader"
	"Error reading CSV: Data can only be read a single time when the source is a reader"

	"Schema must be provided".to_string(),
	"Schema must be provided to CsvRead".to_string(),

Conversation

heymind commented Apr 16, 2021 • edited by alamb Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Apr 16, 2021

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Apr 19, 2021

Uh oh!

alamb commented Apr 19, 2021

Uh oh!

heymind commented Apr 25, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

heymind commented Apr 16, 2021 •

edited by alamb

Loading