StoragetapperStorageTapper is a scalable realtime MySQL change data streaming, logical backup and logical replication service
HdfsAPI and command line interface for HDFS
Smart openUtils for streaming large files (S3, HDFS, gzip, bz2...)
Dcos CommonsDC/OS SDK is a collection of tools, libraries, and documentation for easy integration of technologies such as Kafka, Cassandra, HDFS, Spark, and TensorFlow with DC/OS.
SeaweedfsSeaweedFS is a fast distributed storage system for blobs, objects, files, and data lake, for billions of files! Blob store has O(1) disk seek, cloud tiering. Filer supports Cloud Drive, cross-DC active-active replication, Kubernetes, POSIX FUSE mount, S3 API, S3 Gateway, Hadoop, WebDAV, encryption, Erasure Coding.
Wradlibweather radar data processing - python package
HsuntzuHDFS compress tar zip snappy gzip uncompress untar codec hadoop spark
ElasticctrElasticCTR,即飞桨弹性计算推荐系统,是基于Kubernetes的企业级推荐系统开源解决方案。该方案融合了百度业务场景下持续打磨的高精度CTR模型、飞桨开源框架的大规模分布式训练能力、工业级稀疏参数弹性调度服务,帮助用户在Kubernetes环境中一键完成推荐系统部署,具备高性能、工业级部署、端到端体验的特点,并且作为开源套件,满足二次深度开发的需求。
DynamometerA tool for scale and performance testing of HDFS with a specific focus on the NameNode.
Hdfs ShellHDFS Shell is a HDFS manipulation tool to work with functions integrated in Hadoop DFS
IbisA pandas-like deferred expression system, with first-class SQL support
Bigdata File ViewerA cross-platform (Windows, MAC, Linux) desktop application to view common bigdata binary format like Parquet, ORC, AVRO, etc. Support local file system, HDFS, AWS S3, Azure Blob Storage ,etc.
CamusMirror of Linkedin's Camus
Tiledb PyPython interface to the TileDB storage manager
Rumble⛈️ Rumble 1.11.0 "Banyan Tree"🌳 for Apache Spark | Run queries on your large-scale, messy JSON-like data (JSON, text, CSV, Parquet, ROOT, AVRO, SVM...) | No install required (just a jar to download) | Declarative Machine Learning and more
TiledbThe Universal Storage Engine
HdfsA native go client for HDFS
PucketBucketing and partitioning system for Parquet
Bigdata Interview🎯 🌟[大数据面试题]分享自己在网络上收集的大数据相关的面试题以及自己的答案总结.目前包含Hadoop/Hive/Spark/Flink/Hbase/Kafka/Zookeeper框架的面试题知识总结
Cluster PackA library on top of either pex or conda-pack to make your Python code easily available on a cluster
Hadoop For GeoeventArcGIS GeoEvent Server sample Hadoop connector for storing GeoEvents in HDFS.
SpartaReal Time Analytics and Data Pipelines based on Spark Streaming
Devops Python Tools80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Function, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.
JuicefsJuiceFS is a distributed POSIX file system built on top of Redis and S3.
bigkubeMinikube for big data with Scala and Spark
bigdata-funA complete (distributed) BigData stack, running in containers
leaflet heatmap简单的可视化湖州通话数据 假设数据量很大,没法用浏览器直接绘制热力图,把绘制热力图这一步骤放到线下计算分析。使用Apache Spark并行计算数据之后,再使用Apache Spark绘制热力图,然后用leafletjs加载OpenStreetMap图层和热力图图层,以达到良好的交互效果。现在使用Apache Spark实现绘制,可能是Apache Spark不擅长这方面的计算或者是我没有设计好算法,并行计算的速度比不上单机计算。Apache Spark绘制热力图和计算代码在这 https://github.com/yuanzhaokang/ParallelizeHeatmap.git .
fastdata-clusterFast Data Cluster (Apache Cassandra, Kafka, Spark, Flink, YARN and HDFS with Vagrant and VirtualBox)
ros hadoopHadoop splittable InputFormat for ROS. Process rosbag with Hadoop Spark and other HDFS compatible systems.
fsbrowserFast desktop client for Hadoop Distributed File System
waspWASP is a framework to build complex real time big data applications. It relies on a kind of Kappa/Lambda architecture mainly leveraging Kafka and Spark. If you need to ingest huge amount of heterogeneous data and analyze them through complex pipelines, this is the framework for you.
skeinA tool and library for easily deploying applications on Apache YARN
starlakeStarlake is a Spark Based On Premise and Cloud ELT/ETL Framework for Batch & Stream Processing
ucz-dfsA distributed file system written in Rust.