GitHub Trends

#java #bigquery #database #dbt #delta_lake #elt #etl #hadoop #hive #hudi #iceberg #lakehouse #olap #query_engine #real_time #redshift #snowflake #spark #sql

Apache Doris is a high-performance, real-time analytical database that offers several benefits. It is easy to use with a simple architecture and supports standard SQL, making it compatible with MySQL tools. Doris delivers extremely fast query performance, even under massive data loads, making it ideal for scenarios like report analysis, ad-hoc queries, unified data warehouses, and data lake queries. It also supports federated querying of various data sources and has rich ecosystem integrations with tools like Spark and Flink. This makes Apache Doris a versatile and powerful tool for handling complex analytical tasks efficiently.

https://github.com/apache/doris

GitHub

GitHub - apache/doris: Apache Doris is an easy-to-use, high performance and unified analytics database.

Apache Doris is an easy-to-use, high performance and unified analytics database. - apache/doris

306 views11:18

GitHub Trends

#python #analytics #dagster #data_engineering #data_integration #data_orchestrator #data_pipelines #data_science #etl #metadata #mlops #orchestration #python #scheduler #workflow #workflow_automation

Dagster is a tool that helps you manage and automate your data workflows. You can define your data assets, like tables or machine learning models, using Python functions. Dagster then runs these functions at the right time and keeps your data up-to-date. It offers features like integrated lineage and observability, making it easier to track and manage your data. This tool is useful for every stage of data development, from local testing to production, and it integrates well with other popular data tools. Using Dagster, you can build reusable components, spot data quality issues early, and scale your data pipelines efficiently. This makes your work more productive and helps maintain control over complex data systems.

https://github.com/dagster-io/dagster

GitHub

GitHub - dagster-io/dagster: An orchestration platform for the development, production, and observation of data assets.

An orchestration platform for the development, production, and observation of data assets. - dagster-io/dagster

👍1

369 views23:00

GitHub Trends

#python #airflow #apache #apache_airflow #automation #dag #data_engineering #data_integration #data_orchestrator #data_pipelines #data_science #elt #etl #machine_learning #mlops #orchestration #python #scheduler #workflow #workflow_engine #workflow_orchestration

Apache Airflow is a tool that helps you manage and automate workflows. You can write your workflows as code, making them easier to maintain, version, test, and collaborate on. Airflow lets you schedule tasks and monitor their progress through a user-friendly interface. It supports dynamic pipeline generation, is highly extensible, and scalable, allowing you to define your own operators and executors.

Using Airflow benefits you by making your workflows more organized, efficient, and reliable. It simplifies the process of managing complex tasks and provides clear visualizations of your workflow's performance, helping you identify and troubleshoot issues quickly. This makes it easier to manage data processing and other automated tasks effectively.

https://github.com/apache/airflow

GitHub

GitHub - apache/airflow: Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows - apache/airflow

👍1

374 views14:30

GitHub Trends

#python #etl_pipeline #llm_platform #unstructured_data

Unstract is a powerful tool that helps you extract data from unstructured documents using large language models (LLMs). It has a no-code platform where you can easily develop and test prompts to get the data you need. Here’s how it benefits you You can automate the extraction of data from complex documents without needing to write code.
- **Prompt Studio** You can set up workflows in three simple steps to deploy APIs or ETL pipelines, automating critical business processes.
- **Integration with Various Tools**: Unstract supports multiple LLM providers, vector databases, embedding models, and text extractors, making it versatile and compatible with many systems.

Overall, Unstract saves time and effort by simplifying the process of extracting valuable data from unstructured documents.

https://github.com/Zipstack/unstract

GitHub

GitHub - Zipstack/unstract: No-code LLM Platform to launch APIs and ETL Pipelines to structure unstructured documents

No-code LLM Platform to launch APIs and ETL Pipelines to structure unstructured documents - Zipstack/unstract

418 views12:30

GitHub Trends

#java #batch #cdc #change_data_capture #data_integration #data_pipeline #distributed #elt #etl #flink #kafka #mysql #paimon #postgresql #real_time #schema_evolution

Flink CDC is a tool that helps you move and transform data in real-time or in batches. It makes data integration simple by using YAML files to describe how data should be moved and transformed. This tool offers features like full database synchronization, table sharding, schema evolution, and data transformation. To use it, you need to set up an Apache Flink cluster, download Flink CDC, create a YAML file to define your data sources and sinks, and then run the job. This benefits you by making it easier to manage and integrate your data efficiently across different databases.

https://github.com/apache/flink-cdc

GitHub

GitHub - apache/flink-cdc: Flink CDC is a streaming data integration tool

Flink CDC is a streaming data integration tool. Contribute to apache/flink-cdc development by creating an account on GitHub.

513 views14:30

GitHub Trends

#rust #ai #change_data_capture #context_engineering #data #data_engineering #data_indexing #data_infrastructure #data_processing #etl #hacktoberfest #help_wanted #indexing #knowledge_graph #llm #pipeline #python #rag #real_time #rust #semantic_search

**CocoIndex** is a fast, open-source Python tool (Rust core) for transforming data into AI formats like vector indexes or knowledge graphs. Define simple data flows in ~100 lines of code using plug-and-play blocks for sources, embeddings, and targets—install via `pip install cocoindex`, add Postgres, and run. It auto-syncs fresh data with minimal recompute on changes, tracking lineage. **You save time building scalable RAG/semantic search pipelines effortlessly, avoiding complex ETL and stale data issues for production-ready AI apps.**

https://github.com/cocoindex-io/cocoindex

GitHub

GitHub - cocoindex-io/cocoindex: Data transformation framework for AI. Ultra performant, with incremental processing. 🌟 Star if…

Data transformation framework for AI. Ultra performant, with incremental processing. 🌟 Star if you like it! - cocoindex-io/cocoindex

334 views11:30

About

Blog

Apps

Platform