Scio

A Scala API for Apache Beam and Google Cloud Dataflow.
Alternatives To Scio
Project NameStarsDownloadsRepos Using ThisPackages Using ThisMost Recent CommitTotal ReleasesLatest ReleaseOpen IssuesLicenseLanguage
Scio2,48436a day ago91August 18, 2023149apache-2.0Scala
A Scala API for Apache Beam and Google Cloud Dataflow.
Kafka Connect File Pulse28036 days ago5September 04, 202048apache-2.0Java
🔗 A multipurpose Kafka Connect connector that makes it easy to parse, transform and stream any file, in any format, into Apache Kafka
Hivemq Mqtt Tensorflow Kafka Realtime Iot Machine Learning Training Inference159
3 years ago4apache-2.0Jupyter Notebook
Real Time Big Data / IoT Machine Learning (Model Training and Inference) with HiveMQ (MQTT), TensorFlow IO and Apache Kafka - no additional data store like S3, HDFS or Spark required
Cloudoffice81
3 months ago1apache-2.0HCL
Cloudoffice deploys Nextcloud and OnlyOffice automatically with LetsEncrypt HTTPS certificates. Text and video instructions included. Six compatible cloud providers, or via Ubuntu/Raspberry Pi. Cloud provider deployments include low-cost object storage integration (e.g. S3).
Bigquery To Datastore47
4 years ago1Java
Export a whole BigQuery table to Google Datastore with Apache Beam/Google Dataflow
Esop43
3 months ago18May 03, 20226apache-2.0Java
Cloud-enabled backup and restore tool for Apache Cassandra
Awesome Kubernetes Cn39
5 years ago
🚢收集整理Kubernetes学习资源大全中文版🎉
Tftransform Demo33
5 years agoPython
tf.Transform example for building digital twin with Apache Beam and Tensorflow
Hive Bigquery Storage Handler18
3 months ago8apache-2.0Java
Hive Storage Handler for interoperability between BigQuery and Apache Hive
Kuromoji For Bigquery14
4 months ago5Java
Tokenize Japanese text on BigQuery with Kuromoji in Apache Beam/Google Dataflow at scale
Alternatives To Scio
Select To Compare


Alternative Project Comparisons
Readme

Scio

Build Status codecov.io GitHub license Maven Central Scaladoc Scala Steward badge

Scio Logo

Ecclesiastical Latin IPA: /i.o/, [i.o], [i.io] Verb: I can, know, understand, have knowledge.

Scio is a Scala API for Apache Beam and Google Cloud Dataflow inspired by Apache Spark and Scalding.

Scio 0.3.0 and future versions depend on Apache Beam (org.apache.beam) while earlier versions depend on Google Cloud Dataflow SDK (com.google.cloud.dataflow). See this page for a list of breaking changes.

Features

  • Scala API close to that of Spark and Scalding core APIs
  • Unified batch and streaming programming model
  • Fully managed service*
  • Integration with Google Cloud products: Cloud Storage, BigQuery, Pub/Sub, Datastore, Bigtable
  • JDBC, TensorFlow TFRecords, Cassandra, Elasticsearch and Parquet I/O
  • Interactive mode with Scio REPL
  • Type safe BigQuery
  • Integration with Algebird and Breeze
  • Pipeline orchestration with Scala Futures
  • Distributed cache

* provided by Google Cloud Dataflow

Quick Start

Download and install the Java Development Kit (JDK) version 8.

Install sbt.

Use our giter8 template to quickly create a new Scio job repository:

sbt new spotify/scio.g8

Switch to the new repo (default scio-job) and build it:

cd scio-job
sbt stage

Run the included word count example:

target/universal/stage/bin/scio-job --output=wc

List result files and inspect content:

ls -l wc
cat wc/part-00000-of-00004.txt

Documentation

Getting Started is the best place to start with Scio. If you are new to Apache Beam and distributed data processing, check out the Beam Programming Guide first for a detailed explanation of the Beam programming model and concepts. If you have experience with other Scala data processing libraries, check out this comparison between Scio, Scalding and Spark.

Example Scio pipelines and tests can be found under scio-examples. A lot of them are direct ports from Beam's Java examples. See this page for some of them with side-by-side explanation. Also see Big Data Rosetta Code for common data processing code snippets in Scio, Scalding and Spark.

Artifacts

Scio includes the following artifacts:

  • scio-core: core library
  • scio-test: test utilities, add to your project as a "test" dependency
  • scio-avro: add-on for Avro, can also be used standalone
  • scio-google-cloud-platform: add-on for Google Cloud IO's: BigQuery, Bigtable, Pub/Sub, Datastore, Spanner
  • scio-cassandra*: add-ons for Cassandra
  • scio-elasticsearch*: add-ons for Elasticsearch
  • scio-extra: extra utilities for working with collections, Breeze, etc., best effort support
  • scio-jdbc: add-on for JDBC IO
  • scio-neo4j: add-on for Neo4J IO
  • scio-parquet: add-on for Parquet
  • scio-tensorflow: add-on for TensorFlow TFRecords IO and prediction
  • scio-redis: add-on for Redis
  • scio-smb: add-on for Sort Merge Bucket operations
  • scio-repl: extension of the Scala REPL with Scio specific operations

License

Copyright 2021 Spotify AB.

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0

Popular Apache Projects
Popular Google Cloud Platform Projects
Popular Web Servers Categories
Related Searches

Get A Weekly Email With Trending Projects For These Categories
No Spam. Unsubscribe easily at any time.
Machine Learning
Scala
Cloud Computing
Apache
Spark
Streaming
Google Cloud Platform
Bigquery
Data Flow
Bigtable