SQL-like query language for csv

Select Query

Select query is used to retrieve data from csv files.

select_query
  : [with_clause]
      select_entity
      [order_by_clause]
      [limit_clause]
      [FOR UPDATE]

select_entity
  : select_clause
      [from_clause]
      [where_clause]
      [group_by_clause]
      [having_clause]
  | select_set_entity set_operator [ALL] select_set_entity 

select_set_entity
  : select_entity
  | (select_query)
with_clause
With Clause
select_clause
Select Clause
from_clause
From Clause
where_clause
Where Clause
group_by_clause
Group By Clause
having_clause
Having Clause
order_by_clause
Order By Clause
limit_clause
Limit Clause
set_operator
Set Operators

With Clause

WITH common_table_expression [, common_table_expression ...]
common_table_expression
Common Table Expression

Select Clause

SELECT [DISTINCT] field [, field ...]

Distinct

You can use DISTINCT keyword to retrieve only unique records.

field syntax

field
  : value
  | value AS alias
  | *
  | table_name.*
value
value
alias
identifier
table_name
identifier
*
Asterisk(U+002A *) denotes all columns.

When used alone, the asterisk selects all columns in all tables; when used with a table name, it selects all columns in that table.

From Clause

FROM table [, {table|LATERAL laterable_table} ...]

If multiple tables have been enumerated, tables are joined using cross join.

table syntax

table
  : table_entity
  | table_entity alias 
  | table_entity AS alias
  | join
  | DUAL
  | laterable_table
  | (table)

table_entity
  : table_identifier
  | format_specified_function
  | inline_format_specified_function

table_identifier
  : table_name
  | url
  | table_identification_function
  | STDIN
  
laterable_table
  : subquery
  | subquery alias
  | subquery AS alias

subquery
  : (select_query)

join
  : table CROSS JOIN table
  | table [INNER] JOIN table join_condition
  | table {LEFT|RIGHT|FULL} [OUTER] JOIN table join_condition
  | table NATURAL [INNER] JOIN table
  | table NATURAL {LEFT|RIGHT|FULL} [OUTER] JOIN table
  | table CROSS JOIN LATERAL laterable_table
  | table [INNER] JOIN LATERAL laterable_table join_condition
  | table LEFT [OUTER] JOIN LATERAL laterable_table join_condition
  | table NATURAL [INNER] JOIN LATERAL laterable_table
  | table NATURAL LEFT [OUTER] JOIN LATERAL laterable_table

join_condition
  : ON condition
  | USING (column_name [, column_name, ...])

table_identification_function
  : FILE::(file_path)
  : INLINE::(file_path)
  : URL::(url_string)
  : DATA::(data_string)

format_specified_function
  : CSV(delimiter, table_identifier [, encoding [, no_header [, without_null]]])
  | FIXED(delimiter_positions, table_identifier [, encoding [, no_header [, without_null]]])
  | JSON(json_query, table_identifier)
  | JSONL(json_query, table_identifier)
  | LTSV(table_identifier [, encoding [, without_null]])

inline_format_specified_function  -- Deprecated. Table identification functions can be used instead.
  : CSV_INLINE(delimiter, inline_table_identifier [, encoding [, no_header [, without_null]]])
  | CSV_INLINE(delimiter, csv_data)
  | JSON_INLINE(json_query, inline_table_identifier [, encoding [, no_header [, without_null]]])
  | JSON_INLINE(json_query, json_data)

inline_table_identifier
  : table_name
  | url_identifier

table_name
identifier

A table_name represents a file path, a temporary table, or a inline table. You can use absolute path or relative path from the directory specified by the ”–repository” option as a file path.

When the file name extension is “.csv”, “.tsv”, “.json”, “.jsonl” or “.txt”, the format to be loaded is automatically determined by the file extension, and you can omit it.

FROM `user.csv`          -- Relative path
FROM `/path/to/user.csv` -- Absolute path
FROM user                -- Relative path without file extension

The specifications of the command options are used as file attributes such as encoding to be loaded. If you want to specify the different attributes for each file, you can use table_object expressions for each file to load.

Once a file is loaded, then the data is cached, and it can be loaded with only file name after that within the transaction.

url
A string of characters representing URL starting with a schema name and a colon.

“http”, “https” and “file” schemes are available.

https://example.com/files/data.csv       -- Remote resource downloaded using HTTP GET method
file:///C:/Users/yourname/files/data.csv -- Local file specified by absolute path
file:./data.csv                          -- Local file specified by relative path

An inline table is created from remote resources. The downloaded data is cached until the transaction ends.

The file format is automatically determined when the http response specifies the following content types.

MIME type Format
text/csv CSV
application/json JSON
table_identification_function
Function notation with a name followed by two colons.
  • FILE::(file_path)

    file_path: string

    This is the same as specifying a file using table_name.

  • INLINE::(file_path)

    file_path: string

    Files read by this function are not cached and cannot be updated.

  • URL::(url_string)

    url_string: string

    When specifying a resource using url, the path must be encoded, but this function does not require encoding.

  • DATA::(data_string)

    file_path: string

    This function creates an inline table from a string.

Example of use in a query:

SELECT id,
       tag_name,
       (SELECT COUNT(*) FROM JSON('', DATA::(assets))) AS number_of_assets,
       published_at
  FROM https://api.github.com/repos/mithrandie/csvq/releases
 WHERE prerelease = false
 ORDER BY published_at DESC
 LIMIT 10
alias
identifier

If alias is not specified, table_name stripped its directory path and extension is used as alias.

-- Following expressions are equivalent
FROM `/path/to/user.csv`
FROM `/path/to/user.csv` AS user
select_query
Select Query
condition
value
column_name
identifier
delimiter
string
json_query
JSON Query

Empty string is equivalent to “{}”.

delimiter_positions
string

“SPACES” or JSON Array of integers

encoding
string or identifier

“AUTO”, “UTF8”, “UTF8M”, “UTF16”, “UTF16BE”, “UTF16LE”, “UTF16BEM”, “UTF16LEM” or “SJIS”.

no_header
boolean
without_null
boolean
url_identifier
identifier

A URL of the http or https scheme to refer to a resource.

csv_data
string
json_data
string

Special Tables

DUAL
The dual table has one column and one record, and the only field is empty. This table is used to retrieve pseudo columns.
STDIN
The stdin table loads data from pipe or redirection as a csv data. The stdin table is one of temporary tables that is declared automatically. This table cannot to be used in the interactive shell.

Where Clause

The Where clause is used to filter records.

WHERE condition
condition
value

Group By Clause

The Group By clause is used to group records.

GROUP BY field [, field ...] 
field
value

Having Clause

The Having clause is used to filter grouped records.

HAVING condition
condition
value

Order By Clause

The Order By clause is used to sort records.

ORDER BY order_item [, order_item ...]

order item

order_item
  : field [order_direction] [null_position]
  
order_direction
  : {ASC|DESC}
  
null_position
  : NULLS {FIRST|LAST}
field
value

If DISTINCT keyword is specified in the select clause, you can use only enumerated fields in the select clause as field.

order_direction
ASC sorts records in ascending order. DESC sorts in descending order. ASC is the default.
null_position
FIRST puts null values first. LAST puts null values last. If order_direction is specified as ASC then FIRST is the default, otherwise LAST is the default.

Limit Clause

The Limit clause is used to specify the maximum number of records to return and exclude the first set of records.

limit_clause
  : LIMIT number_of_records [{ROW|ROWS}] [{ONLY|WITH TIES}] [offset_clause]
  | LIMIT percentage PERCENT [{ONLY|WITH TIES}] [offset_clause]
  | [offset_clause] FETCH {FIRST|NEXT} number_of_records {ROW|ROWS} [{ONLY|WITH TIES}]
  | [offset_clause] FETCH {FIRST|NEXT} percentage PERCENT [{ONLY|WITH TIES}]
  | offset_clause

offset_clause
  : OFFSET number_of_records [{ROW|ROWS}]
number_of_records
integer
percent
float

ROW and ROWS after number_of_records, FIRST and NEXT after FETCH, and ONLY keyword does not affect the result.

If PERCENT keyword is specified, maximum number of records is percentage percent of the result set that includes the excluded records by offset_clause.

If WITH TIES keywords are specified, all records that have the same sort keys specified by Order By Clause as the last record of the limited records are included in the records to return. If there is no Order By Clause in the query, WITH TIES keywords are ignored.