Project Name | Stars | Downloads | Repos Using This | Packages Using This | Most Recent Commit | Total Releases | Latest Release | Open Issues | License | Language |
---|---|---|---|---|---|---|---|---|---|---|
Superset | 51,144 | 2 | a day ago | 3 | April 29, 2022 | 1,312 | apache-2.0 | TypeScript | ||
Apache Superset is a Data Visualization and Data Exploration Platform | ||||||||||
Cloudquery | 4,244 | 6 | 5 hours ago | 241 | August 14, 2022 | 181 | mpl-2.0 | Go | ||
The open source high performance data integration platform built for developers. | ||||||||||
Mage Ai | 3,691 | 19 hours ago | 9 | June 27, 2022 | 54 | apache-2.0 | Python | |||
🧙 The modern replacement for Airflow. Build, run, and manage data pipelines for integrating and transforming data. | ||||||||||
Data Diff | 2,231 | a day ago | 16 | July 06, 2022 | 79 | mit | Python | |||
Compare tables within or across databases | ||||||||||
Data Science Roadmap | 1,076 | 15 days ago | 1 | mit | ||||||
Data Science Roadmap from A to Z | ||||||||||
Awesome Opensource Data Engineering | 950 | 18 days ago | 3 | other | ||||||
An Awesome List of Open-Source Data Engineering Projects | ||||||||||
Quadratic | 800 | 2 days ago | 82 | mit | TypeScript | |||||
Quadratic | Technical Spreadsheet | ||||||||||
Dataform | 628 | 2 | 6 | a month ago | 194 | July 12, 2022 | 57 | mit | TypeScript | |
Dataform is a framework for managing SQL based data operations in BigQuery, Snowflake, and Redshift | ||||||||||
Data Engineering Wiki | 518 | 5 days ago | cc0-1.0 | CSS | ||||||
The best place to learn data engineering. Built and maintained by the data engineering community. | ||||||||||
Blaze | 354 | 20 days ago | 24 | apache-2.0 | Rust | |||||
Blazing-fast query execution engine speaks Apache Spark language and has Arrow-DataFusion at its core. |
An end-to-end data lineage tool, detects table dependencies from SQL statements.
Data Source | Remarks |
---|---|
Local file system | Python Pathlib module |
Amazon S3 | Available for Amazon Managed Workflows for Apache Airflow (MWAA) |
Google Cloud Storage | Available for Google Cloud Composer |
dbt - Google BigQuery | Using dbt compile command internally |
Redash |
This package is distributed on PyPI.
# The base package is for local file system only.
$ pip install stairlight
# Set extras when detecting from other data sources.
# e.g. Amazon S3 and Google Cloud Storage
$ pip install "stairlight[s3, gcs]"
Data Source | TemplateSourceType | Extra |
---|---|---|
Local file system | File | - |
Amazon S3 | S3 | s3 |
Google Cloud Storage | GCS | gcs |
dbt - Google Bigquery | dbt | dbt-bigquery |
Redash | Redash | redash |
There are 3 steps to use.
# 1: Initialize and set your data source settings
$ stairlight init
# 2: Map your SQL statements and tables
$ stairlight map
# 3: Get table dependencies
$ stairlight
Stairlight outputs table dependencies as JSON format.
Top-level keys are table names, and values represents tables that are the data source for each key's table.
{
"test_project.beam_streaming.taxirides_aggregation": {
"test_project.beam_streaming.taxirides_realtime": {
"TemplateSourceType": "File",
"Key": "tests/sql/main/union_same_table.sql",
"Uri": "/foo/bar/stairlight/tests/sql/main/union_same_table.sql",
"Lines": [
{
"LineNumber": 6,
"LineString": " test_project.beam_streaming.taxirides_realtime"
},
{
"LineNumber": 15,
"LineString": " test_project.beam_streaming.taxirides_realtime"
}
]
}
},
"PROJECT_a.DATASET_b.TABLE_c": {
"PROJECT_A.DATASET_A.TABLE_A": {
"TemplateSourceType": "GCS",
"Key": "sql/one_line/one_line.sql",
"Uri": "gs://stairlight/sql/one_line/one_line.sql",
"Lines": [
{
"LineNumber": 1,
"LineString": "SELECT * FROM PROJECT_A.DATASET_A.TABLE_A WHERE 1 = 1"
}
],
"BucketName": "stairlight",
"Labels": {
"Source": null,
"Test": "a"
}
}
},
"AggregateSales": {
"PROJECT_e.DATASET_e.TABLE_e": {
"TemplateSourceType": "Redash",
"Key": 5,
"Uri": "AggregateSales",
"Lines": [
{
"LineNumber": 1,
"LineString": "SELECT service, SUM(total_amount) FROM PROJECT_e.DATASET_e.TABLE_e GROUP BY service"
}
],
"DataSourceName": "BigQuery",
"Labels": {
"Category": "Sales"
}
}
},
"dummy.dummy.example_b": {
"PROJECT_t.DATASET_t.TABLE_t": {
"TemplateSourceType": "dbt",
"Key": "tests/dbt/project_01/target/compiled/project_01/models/b/example_b.sql",
"Uri": "/foo/bar/stairlight/tests/dbt/project_01/target/compiled/project_01/models/b/example_b.sql",
"Lines": [
{
"LineNumber": 1,
"LineString": "select * from PROJECT_t.DATASET_t.TABLE_t where value_a = 0 and value_b = 0"
}
]
}
},
"PROJECT_as.DATASET_bs.TABLE_cs": {
"PROJECT_A.DATASET_A.TABLE_A": {
"TemplateSourceType": "S3",
"Key": "sql/one_line/one_line.sql",
"Uri": "s3://stairlight/sql/one_line/one_line.sql",
"Lines": [
{
"LineNumber": 1,
"LineString": "SELECT * FROM PROJECT_A.DATASET_A.TABLE_A WHERE 1 = 1"
}
],
"BucketName": "stairlight",
"Labels": {
"Source": null,
"Test": "a"
}
}
}
}
Examples can be found here, used for unit testing in CI.
'stairlight.yaml' is for setting up Stairlight itself. It is responsible for specifying SQL statements to be read.
stairlight init
creates a template of stairlight.yaml.
Include:
- TemplateSourceType: File
FileSystemPath: ./tests/sql
Regex: .*/*\.sql$
DefaultTablePrefix: "PROJECT_A"
- TemplateSourceType: GCS
ProjectId: null
BucketName: stairlight
Regex: ^sql/.*/*\.sql$
DefaultTablePrefix: "PROJECT_A"
- TemplateSourceType: Redash
DatabaseUrlEnvironmentVariable: REDASH_DATABASE_URL
DataSourceName: BigQuery
QueryIds:
- 1
- 3
- 5
- TemplateSourceType: dbt
ProjectDir: tests/dbt/project_01
ProfilesDir: tests/dbt
Vars:
key_a: value_a
key_b: value_b
- TemplateSourceType: S3
BucketName: stairlight
Regex: ^sql/.*/*\.sql$
DefaultTablePrefix: "PROJECT_A"
Exclude:
- TemplateSourceType: File
Regex: main/exclude\.sql$
Settings:
MappingFilesRegex:
- .*/mapping\_file\.yaml$
- .*/mapping\_gcs\.yaml$
- .*/mapping\_dbt\.yaml$
- .*/mapping\_s3\.yaml$
# Deprecated from v0.7.2
MappingPrefix: "mapping"
'mapping.yaml' is used to define relationships between input SELECT statements and tables.
stairlight map
creates a template of mapping.yaml and attempts to read from data sources specified in stairlight.yaml.
If successfully read, it outputs settings that have not yet configured in an existing 'mapping.yaml' file.
Global:
Parameters:
DESTINATION_PROJECT: stairlight
params:
PROJECT: 1234567890
DATASET: public
TABLE: taxirides
Mapping:
- TemplateSourceType: File
FileSuffix: "tests/sql/main/union_same_table.sql"
Tables:
- TableName: "test_project.beam_streaming.taxirides_aggregation"
Parameters:
params:
source_table: source
destination_table: destination
IgnoreParameters:
- execution_date.add(days=1).isoformat()
- TemplateSourceType: GCS
Uri: "gs://stairlight/sql/one_line/one_line.sql"
Tables:
- TableName: "PROJECT_a.DATASET_b.TABLE_c"
- TemplateSourceType: Redash
QueryId: 5
DataSourceName: metadata
Tables:
- TableName: New Query
Parameters:
table: dashboards
Labels:
Category: Redash test
- TemplateSourceType: dbt
ProjectName: project_01
FileSuffix: tests/dbt/project_01/target/compiled/project_01/models/example/my_first_dbt_model.sql
Tables:
- TableName: dummy.dummy.my_first_dbt_model
- TemplateSourceType: S3
Uri: "s3://stairlight/sql/one_line/one_line.sql"
Tables:
- TableName: "PROJECT_as.DATASET_bs.TABLE_cs"
ExtraLabels:
- TableName: "PROJECT_A.DATASET_A.TABLE_A"
Labels:
Source: Null
Test: a
This section is for global configurations.
Parameters
is used to set common parameters. If conflicts has occurred with Parameters
in mapping section, mapping section's parameters will be used in preference to global.
Mapping section is used to define relationships between input SELECT statements and tables that created as a result of query execution.
Parameters
allows you to reflect settings in jinja template variables embedded in statements. If multiple settings are applied to a statement using jinja template, the statement will be read as if there were the same number of queries as the number of settings.
In contrast, IgnoreParameters
handles a list to ignore when rendering queries.
This section sets labels to tables that appears only in queries.
$ stairlight --help
usage: stairlight [-h] [-c CONFIG] [--save SAVE] [--load LOAD] {init,check,up,down} ...
An end-to-end data lineage tool, detects table dependencies by SQL SELECT statements.
Without positional arguments, return a table dependency map as JSON format.
positional arguments:
{init,map,check,list,up,down}
init create a new Stairlight configuration file
map (check) create a new configuration file about undefined mappings
list return all ( tables | URIs )
up return upstairs ( tables | URIs )
down return downstairs ( tables | URIs )
optional arguments:
-h, --help show this help message and exit
-c CONFIG, --config CONFIG
set a Stairlight configuration directory
-q, --quiet keep silence
--save SAVE A file path where map results will be saved.
You can choose from local file system, GCS, S3.
--load LOAD A file path where map results are saved.
You can choose from local file system, GCS, S3.
It can be specified multiple times.
stairlight init
creates a new Stairlight configuration file.
$ stairlight init --help
usage: stairlight init [-h] [-c CONFIG]
optional arguments:
-h, --help show this help message and exit
-c CONFIG, --config CONFIG
set a Stairlight configuration directory
-q, --quiet keep silence
stairlight map
creates a new configuration file about undefined settings. stairlight check
is an alias.
Options are the same as stairlight init
.
stairlight list
outputs all of tables or SQL URIs.
-o
, --output
) determines the output type, tables or URIs.stairlight up
outputs tables or SQL URIs located upstream(upstairs) from the specified table.
-t
, --table
) or label(-l
, --label
) option to specify tables to search.-o
, --output
) is same as stairlight list
.-r
, --recursive
) is set, Stairlight will find dependencies recursively and output as a list.-v
, --verbose
) is set, Stairlight will add detailed information and output it as a dict.$ stairlight up --help
usage: stairlight up [-h] [-c CONFIG] [--save SAVE] [--load LOAD] (-t TABLE | -l LABEL) [-o {table,uri}]
[-v] [-r]
optional arguments:
-h, --help show this help message and exit
-c CONFIG, --config CONFIG
set a Stairlight configuration directory
-q, --quiet keep silence
--save SAVE A file path where mapped results will be saved.
You can choose from local file system, GCS, S3.
--load LOAD A file path where mapped results are saved.
You can choose from local file system, GCS, S3.
It can be specified multiple times.
-t TABLE, --table TABLE
table names that Stairlight searches for, can be specified
multiple times. e.g. -t PROJECT_a.DATASET_b.TABLE_c -t
PROJECT_d.DATASET_e.TABLE_f
-l LABEL, --label LABEL
labels set for the table in mapping configuration, can be specified multiple times.
The separator between key and value should be a colon(:).
e.g. -l key_1:value_1 -l key_2:value_2
-o {table,uri}, --output {table,uri}
output type
-v, --verbose return verbose results
-r, --recursive search recursively
stairlight down
outputs tables or SQL URIs located downstream(downstairs) from the specified table.
Options are the same as stairlight up
.
Stairlight can also be used as a library.
tosh2230/stairlight-app is a sample web application rendering table dependency graph with Stairlight, using Graphviz, Streamlit and Google Cloud Run.