- Installation
- Guides
- Overview
- SQL Features
- Data Import & Export
- CSV Import
- CSV Export
- Parquet Import
- Parquet Export
- Query Parquet
- HTTP Parquet Import
- S3 Parquet Import
- S3 Parquet Export
- JSON Import
- JSON Export
- Excel Import
- Excel Export
- SQLite Import
- Postgres Import
- Meta Queries
- Python
- Install
- Execute SQL
- Jupyter Notebooks
- SQL on Pandas
- Import From Pandas
- Export To Pandas
- SQL on Arrow
- Import From Arrow
- Export To Arrow
- Relational API on Pandas
- Multiple Python Threads
- DuckDB with Ibis
- DuckDB with Fugue
- DuckDB with Polars
- DuckDB with Vaex
- DuckDB with DataFusion
- DuckDB with fsspec filesystems
- SQL Editors
- Data Viewers
- Documentation
- Connect
- Data Import
- Overview
- CSV Files
- JSON Files
- Multiple Files
- Parquet Files
- Partitioning
- Appender
- Insert Statements
- Client APIs
- Overview
- C
- Overview
- Startup
- Configure
- Query
- Data Chunks
- Values
- Types
- Prepared Statements
- Appender
- Table Functions
- Replacement Scans
- API Reference
- C++
- CLI
- Java
- Julia
- Node.js
- ODBC
- Python
- Overview
- Data Ingestion
- Result Conversion
- DB API
- Relational API
- Function API
- Types API
- API Reference
- R
- Rust
- Scala
- Swift
- Wasm
- SQL
- Introduction
- Statements
- Overview
- Alter Table
- Attach/Detach
- Call
- Checkpoint
- Copy
- Create Macro
- Create Schema
- Create Sequence
- Create Table
- Create View
- Delete
- Drop
- Export
- Insert
- Pivot
- Select
- Set/Reset
- Unpivot
- Update
- Use
- Vacuum
- Query Syntax
- SELECT
- FROM & JOIN
- WHERE
- GROUP BY
- GROUPING SETS
- HAVING
- ORDER BY
- LIMIT
- SAMPLE
- UNNEST
- WITH
- WINDOW
- QUALIFY
- VALUES
- FILTER
- Set Operations
- Data Types
- Overview
- Bitstring
- Blob
- Boolean
- Date
- Enum
- Interval
- List
- Map
- NULL Values
- Numeric
- Struct
- Text
- Timestamp
- Union
- Expressions
- Functions
- Overview
- Bitstring Functions
- Blob Functions
- Date Format Functions
- Date Functions
- Date Part Functions
- Enum Functions
- Interval Functions
- Nested Functions
- Numeric Functions
- Pattern Matching
- Text Functions
- Time Functions
- Timestamp Functions
- Timestamp With Time Zone Functions
- Utility Functions
- Aggregates
- Configuration
- Constraints
- Indexes
- Information Schema
- Metadata Functions
- Pragmas
- Samples
- Window Functions
- Extensions
- Sitemap
- Why DuckDB
- Media
- FAQ
- Code of Conduct
- Live Demo
Below is a collection of tips to help when attempting to process especially gnarly CSV files.
Override the header flag if the header is not correctly detected
If a file contains only string columns the header
auto-detection might fail. Provide the header
option to override this behavior.
SELECT * FROM read_csv_auto('flights.csv', header=True);
Provide names if the file does not contain a header
If the file does not contain a header, names will be auto-generated by default. You can provide your own names with the names
option.
SELECT * FROM read_csv_auto('flights.csv', names=['FlightDate', 'UniqueCarrier']);
Override the types of specific columns
The types
flag can be used to override types of only certain columns by providing a struct of name -> type
mappings.
SELECT * FROM read_csv_auto('flights.csv', types={'FlightDate': 'DATE'});
Use COPY
when loading data into a table
The COPY
statement copies data directly into a table. The CSV reader uses the schema of the table instead of auto-detecting types from the file. This speeds up the auto-detection, and prevents mistakes from being made during auto-detection.
COPY tbl FROM 'test.csv' (AUTO_DETECT 1);
Use union_by_name
when loading files with different schemas
The union_by_name
option can be used to unify the schema of files that have different or missing columns. For files that do not have certain columns, NULL
values are filled in.
SELECT * FROM read_csv_auto('flights*.csv', union_by_name=True);