#python #onnx #pytorch #voice_activity_detection #voice_commands #voice_control #voice_detection #voice_recognition
https://github.com/snakers4/silero-vad
https://github.com/snakers4/silero-vad
GitHub
GitHub - snakers4/silero-vad: Silero VAD: pre-trained enterprise-grade Voice Activity Detector
Silero VAD: pre-trained enterprise-grade Voice Activity Detector - snakers4/silero-vad
#python #agents #ai #multimodal #real_time #video #voice #voice_assistant
The Agents framework helps you build AI-driven programs that can interact with users in real-time through text, audio, images, or video. It integrates with OpenAI's Realtime API for ultra-low latency interactions and supports various plugins for speech-to-text, text-to-speech, and other AI services. You can use it to create voice assistants, transcription agents, and more, with easy deployment across local, self-hosted, or cloud environments. This makes it easier to develop interactive AI applications quickly and efficiently.
https://github.com/livekit/agents
The Agents framework helps you build AI-driven programs that can interact with users in real-time through text, audio, images, or video. It integrates with OpenAI's Realtime API for ultra-low latency interactions and supports various plugins for speech-to-text, text-to-speech, and other AI services. You can use it to create voice assistants, transcription agents, and more, with easy deployment across local, self-hosted, or cloud environments. This makes it easier to develop interactive AI applications quickly and efficiently.
https://github.com/livekit/agents
GitHub
GitHub - livekit/agents: A powerful framework for building realtime voice AI agents 🤖🎙️📹
A powerful framework for building realtime voice AI agents 🤖🎙️📹 - GitHub - livekit/agents: A powerful framework for building realtime voice AI agents 🤖🎙️📹
#python #asr #audio #audio_processing #deep_learning #huggingface #language_model #pytorch #speaker_diarization #speaker_recognition #speaker_verification #speech_enhancement #speech_processing #speech_recognition #speech_separation #speech_to_text #speech_toolkit #speechrecognition #spoken_language_understanding #transformers #voice_recognition
SpeechBrain is an open-source toolkit that helps you quickly develop Conversational AI technologies, such as speech assistants, chatbots, and language models. It uses PyTorch and offers many pre-trained models and tutorials to make it easy to get started. You can train models for various tasks like speech recognition, speaker recognition, and text processing with just a few lines of code. SpeechBrain also supports GPU training, dynamic batching, and integration with HuggingFace models, making it powerful and efficient. This toolkit is beneficial because it simplifies the development process, provides extensive documentation and tutorials, and is highly customizable, making it ideal for research, prototyping, and educational purposes.
https://github.com/speechbrain/speechbrain
SpeechBrain is an open-source toolkit that helps you quickly develop Conversational AI technologies, such as speech assistants, chatbots, and language models. It uses PyTorch and offers many pre-trained models and tutorials to make it easy to get started. You can train models for various tasks like speech recognition, speaker recognition, and text processing with just a few lines of code. SpeechBrain also supports GPU training, dynamic batching, and integration with HuggingFace models, making it powerful and efficient. This toolkit is beneficial because it simplifies the development process, provides extensive documentation and tutorials, and is highly customizable, making it ideal for research, prototyping, and educational purposes.
https://github.com/speechbrain/speechbrain
GitHub
GitHub - speechbrain/speechbrain: A PyTorch-based Speech Toolkit
A PyTorch-based Speech Toolkit. Contribute to speechbrain/speechbrain development by creating an account on GitHub.
#python #audio_generation #audio_synthesis #audioldm #audit #fastspeech2 #hifi_gan #music_generation #naturalspeech2 #singing_voice_conversion #speech_synthesis #text_to_audio #text_to_speech #vall_e #vits #voice_conversion
Amphion is a toolkit for generating audio, music, and speech. It helps researchers and engineers, especially beginners, by providing tools for various tasks like turning text into speech (TTS), singing voice conversion (SVC), and text to audio (TTA). Amphion includes visualizations to help understand how these models work, which is very useful for learning. It also offers different vocoders to produce high-quality audio and evaluation metrics to ensure the generated audio is good. This toolkit is free to use under the MIT License and can be installed easily using Python or Docker. Using Amphion, you can create high-quality audio and music with advanced features, making it a powerful tool for both research and practical applications.
https://github.com/open-mmlab/Amphion
Amphion is a toolkit for generating audio, music, and speech. It helps researchers and engineers, especially beginners, by providing tools for various tasks like turning text into speech (TTS), singing voice conversion (SVC), and text to audio (TTA). Amphion includes visualizations to help understand how these models work, which is very useful for learning. It also offers different vocoders to produce high-quality audio and evaluation metrics to ensure the generated audio is good. This toolkit is free to use under the MIT License and can be installed easily using Python or Docker. Using Amphion, you can create high-quality audio and music with advanced features, making it a powerful tool for both research and practical applications.
https://github.com/open-mmlab/Amphion
GitHub
GitHub - open-mmlab/Amphion: Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support…
Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audi...
👍1
#python #ai_translation #dubbing #localization #video_translation #voice_cloning
VideoLingo is a powerful tool that helps translate, localize, and dub videos, making them understandable across different languages. It uses advanced technologies like WhisperX for accurate subtitle recognition and GPT for high-quality translations. The tool ensures single-line subtitles, similar to those on Netflix, and offers dubbing alignment for a more natural viewing experience. You can use it online, in Google Colab, or install it locally on your computer. This makes it easier to share videos globally without language barriers, enhancing global knowledge sharing and communication.
https://github.com/Huanshere/VideoLingo
VideoLingo is a powerful tool that helps translate, localize, and dub videos, making them understandable across different languages. It uses advanced technologies like WhisperX for accurate subtitle recognition and GPT for high-quality translations. The tool ensures single-line subtitles, similar to those on Netflix, and offers dubbing alignment for a more natural viewing experience. You can use it online, in Google Colab, or install it locally on your computer. This makes it easier to share videos globally without language barriers, enhancing global knowledge sharing and communication.
https://github.com/Huanshere/VideoLingo
GitHub
GitHub - Huanshere/VideoLingo: Netflix-level subtitle cutting, translation, alignment, and even dubbing - one-click fully automated…
Netflix-level subtitle cutting, translation, alignment, and even dubbing - one-click fully automated AI video subtitle team | Netflix级字幕切割、翻译、对齐、甚至加上配音,一键全自动视频搬运AI字幕组 - Huanshere/VideoLingo
#python #agent #ai #asr #cpp #gemini #golang #gpt_4 #gpt_4o #llm #low_latency #multimodal #nextjs14 #openai #python #rag #real_time #realtime #tts #vision #voice_assistant
The TEN Agent is a powerful tool that helps you create and manage AI agents with various capabilities like real-time vision, screen detection, and integration with services like Google Gemini Multimodal Live API, Weather Check, and Web Search. To use it, you need to set up your environment with Docker, Node.js, and specific API keys. You can follow simple steps to configure and start your agent locally. The benefits include easy integration of advanced AI features, a supportive community through Discord and GitHub discussions, and the ability to customize and extend your agents with ready-to-use extensions. This makes it easier to develop and deploy sophisticated AI applications quickly.
https://github.com/TEN-framework/TEN-Agent
The TEN Agent is a powerful tool that helps you create and manage AI agents with various capabilities like real-time vision, screen detection, and integration with services like Google Gemini Multimodal Live API, Weather Check, and Web Search. To use it, you need to set up your environment with Docker, Node.js, and specific API keys. You can follow simple steps to configure and start your agent locally. The benefits include easy integration of advanced AI features, a supportive community through Discord and GitHub discussions, and the ability to customize and extend your agents with ready-to-use extensions. This makes it easier to develop and deploy sophisticated AI applications quickly.
https://github.com/TEN-framework/TEN-Agent
GitHub
GitHub - TEN-framework/ten-framework: Open-source framework for conversational voice AI agents
Open-source framework for conversational voice AI agents - TEN-framework/ten-framework
#python #deep_learning #glow_tts #hifigan #melgan #multi_speaker_tts #python #pytorch #speaker_encoder #speaker_encodings #speech #speech_synthesis #tacotron #text_to_speech #tts #tts_model #vocoder #voice_cloning #voice_conversion #voice_synthesis
The new version of TTS (Text-to-Speech) from Coqui.ai, called TTSv2, is now available with several improvements. It supports 16 languages and has better performance overall. You can fine-tune the models using the provided code and examples. The TTS system can now stream audio with less than 200ms latency, making it very responsive. Additionally, you can use over 1,100 Fairseq models and new features like voice cloning and voice conversion. This update also includes faster inference with the Tortoise model and support for multiple speakers and languages. These enhancements make it easier and more efficient to generate high-quality speech from text.
https://github.com/coqui-ai/TTS
The new version of TTS (Text-to-Speech) from Coqui.ai, called TTSv2, is now available with several improvements. It supports 16 languages and has better performance overall. You can fine-tune the models using the provided code and examples. The TTS system can now stream audio with less than 200ms latency, making it very responsive. Additionally, you can use over 1,100 Fairseq models and new features like voice cloning and voice conversion. This update also includes faster inference with the Tortoise model and support for multiple speakers and languages. These enhancements make it easier and more efficient to generate high-quality speech from text.
https://github.com/coqui-ai/TTS
GitHub
GitHub - coqui-ai/TTS: 🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production - coqui-ai/TTS
#python #audiobooks #chinese #docker #english #epub #gradio #linux #mac #multilingual #tts #voice_cloning #windows #xtts
This tool converts eBooks into audiobooks with chapters and metadata, supporting 1124 languages and optional voice cloning. Here’s how it benefits you It converts eBooks in various formats (like `.epub`, `.pdf`, `.mobi`) into audiobooks with high-quality text-to-speech using tools like Calibre, ffmpeg, and XTTSv2.
- **Multilingual Support** You can clone your own voice or use default voices for the audiobook.
- **User-Friendly Interface** You can run it on your local machine or use Docker for consistent results across different environments.
- **Free Resources**: There are options to use free resources like Google Colab or rent a GPU for faster processing.
Make sure to use this tool responsibly with non-DRM, legally acquired eBooks.
https://github.com/DrewThomasson/ebook2audiobook
This tool converts eBooks into audiobooks with chapters and metadata, supporting 1124 languages and optional voice cloning. Here’s how it benefits you It converts eBooks in various formats (like `.epub`, `.pdf`, `.mobi`) into audiobooks with high-quality text-to-speech using tools like Calibre, ffmpeg, and XTTSv2.
- **Multilingual Support** You can clone your own voice or use default voices for the audiobook.
- **User-Friendly Interface** You can run it on your local machine or use Docker for consistent results across different environments.
- **Free Resources**: There are options to use free resources like Google Colab or rent a GPU for faster processing.
Make sure to use this tool responsibly with non-DRM, legally acquired eBooks.
https://github.com/DrewThomasson/ebook2audiobook
GitHub
GitHub - DrewThomasson/ebook2audiobook: Generate audiobooks from e-books, voice cloning & 1158+ languages!
Generate audiobooks from e-books, voice cloning & 1158+ languages! - DrewThomasson/ebook2audiobook
#python #text_to_speech #tts #vits #voice_clone #voice_cloneai #voice_cloning
GPT-SoVITS-WebUI is a powerful tool for converting text to speech and changing voices. Here’s what it offers** You can convert text to speech instantly with just a 5-second vocal sample.
- **Few-shot TTS** It works in several languages including English, Japanese, Korean, Cantonese, and Chinese.
- **WebUI Tools:** It includes tools like voice separation, automatic training set segmentation, and text labeling, making it easier to create and use the models.
Using GPT-SoVITS-WebUI benefits you by allowing quick and easy voice conversions and text-to-speech functions with high quality and flexibility.
https://github.com/RVC-Boss/GPT-SoVITS
GPT-SoVITS-WebUI is a powerful tool for converting text to speech and changing voices. Here’s what it offers** You can convert text to speech instantly with just a 5-second vocal sample.
- **Few-shot TTS** It works in several languages including English, Japanese, Korean, Cantonese, and Chinese.
- **WebUI Tools:** It includes tools like voice separation, automatic training set segmentation, and text labeling, making it easier to create and use the models.
Using GPT-SoVITS-WebUI benefits you by allowing quick and easy voice conversions and text-to-speech functions with high quality and flexibility.
https://github.com/RVC-Boss/GPT-SoVITS
GitHub
GitHub - RVC-Boss/GPT-SoVITS: 1 min voice data can also be used to train a good TTS model! (few shot voice cloning)
1 min voice data can also be used to train a good TTS model! (few shot voice cloning) - RVC-Boss/GPT-SoVITS
#python #singing_voice_conversion #voice_conversion
This tool helps you change voices in real-time or offline. It supports voice conversion, singing voice conversion, and can clone a voice with just 1-30 seconds of reference speech. You can use it for online meetings, gaming, or live streaming. The model is easy to fine-tune with custom data, requiring only one utterance per speaker. This makes it useful for creating personalized voice effects quickly and efficiently.
https://github.com/Plachtaa/seed-vc
This tool helps you change voices in real-time or offline. It supports voice conversion, singing voice conversion, and can clone a voice with just 1-30 seconds of reference speech. You can use it for online meetings, gaming, or live streaming. The model is easy to fine-tune with custom data, requiring only one utterance per speaker. This makes it useful for creating personalized voice effects quickly and efficiently.
https://github.com/Plachtaa/seed-vc
GitHub
GitHub - Plachtaa/seed-vc: zero-shot voice conversion & singing voice conversion, with real-time support
zero-shot voice conversion & singing voice conversion, with real-time support - Plachtaa/seed-vc
👍2
#python #agentic_ai #agents #ai #autonomous_agents #deepseek_r1 #llm #llm_agents #voice_assistant
AgenticSeek is a free, fully local AI assistant that runs entirely on your own computer, ensuring your data stays private with no cloud or API use. It can autonomously browse the web, write and debug code in many languages, plan and execute complex tasks, and even respond to voice commands. It smartly chooses the best AI agent for each task, making it like having a personal team of experts. This local setup avoids monthly fees and protects your privacy while giving you powerful AI help for coding, research, and task management all on your device[1][2].
https://github.com/Fosowl/agenticSeek
AgenticSeek is a free, fully local AI assistant that runs entirely on your own computer, ensuring your data stays private with no cloud or API use. It can autonomously browse the web, write and debug code in many languages, plan and execute complex tasks, and even respond to voice commands. It smartly chooses the best AI agent for each task, making it like having a personal team of experts. This local setup avoids monthly fees and protects your privacy while giving you powerful AI help for coding, research, and task management all on your device[1][2].
https://github.com/Fosowl/agenticSeek
GitHub
GitHub - Fosowl/agenticSeek: Fully Local Manus AI. No APIs, No $200 monthly bills. Enjoy an autonomous agent that thinks, browses…
Fully Local Manus AI. No APIs, No $200 monthly bills. Enjoy an autonomous agent that thinks, browses the web, and code for the sole cost of electricity. 🔔 Official updates only via twitter @Martin9...
❤1👍1
#jupyter_notebook #android #asr #deep_learning #deep_neural_networks #deepspeech #google_speech_to_text #ios #kaldi #offline #privacy #python #raspberry_pi #speaker_identification #speaker_verification #speech_recognition #speech_to_text #speech_to_text_android #stt #voice_recognition #vosk
Vosk is a powerful tool for recognizing speech without needing the internet. It supports over 20 languages and dialects, making it useful for many different users. Vosk is small and efficient, allowing it to work on small devices like smartphones and Raspberry Pi. It can be used for things like chatbots, smart home devices, and creating subtitles for videos. This means users can have private and fast speech recognition anywhere, which is especially helpful when internet access is limited.
https://github.com/alphacep/vosk-api
Vosk is a powerful tool for recognizing speech without needing the internet. It supports over 20 languages and dialects, making it useful for many different users. Vosk is small and efficient, allowing it to work on small devices like smartphones and Raspberry Pi. It can be used for things like chatbots, smart home devices, and creating subtitles for videos. This means users can have private and fast speech recognition anywhere, which is especially helpful when internet access is limited.
https://github.com/alphacep/vosk-api
GitHub
GitHub - alphacep/vosk-api: Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and…
Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node - alphacep/vosk-api
#python #audiobook #audiobooks #content_creation #content_creator #epub_converter #kokoro #kokoro_82m #kokoro_tts #media_generation #narrator #speech_synthesis #subtitles #text_to_audio #text_to_speech #tts #voice_synthesis
Abogen is a user-friendly tool that quickly converts ePub, PDF, or text files into natural-sounding audio with synchronized subtitles, perfect for creating audiobooks or voiceovers for social media and other projects. You can customize speech speed, choose or mix voices, generate subtitles by sentence or word, and select various audio and subtitle formats. It supports batch processing with queue mode and lets you save chapters separately or merged. Installation is straightforward on Windows, Mac, and Linux, with options for GPU acceleration. This saves you time and effort in producing high-quality audio content from text files efficiently.
https://github.com/denizsafak/abogen
Abogen is a user-friendly tool that quickly converts ePub, PDF, or text files into natural-sounding audio with synchronized subtitles, perfect for creating audiobooks or voiceovers for social media and other projects. You can customize speech speed, choose or mix voices, generate subtitles by sentence or word, and select various audio and subtitle formats. It supports batch processing with queue mode and lets you save chapters separately or merged. Installation is straightforward on Windows, Mac, and Linux, with options for GPU acceleration. This saves you time and effort in producing high-quality audio content from text files efficiently.
https://github.com/denizsafak/abogen
GitHub
GitHub - denizsafak/abogen: Generate audiobooks from EPUBs, PDFs and text with synchronized captions.
Generate audiobooks from EPUBs, PDFs and text with synchronized captions. - denizsafak/abogen
❤1
#python #text_to_speech #tts #voice_clone #zero_shot_tts
OpenVoice is a free, open-source tool that lets you clone any voice using just a short audio sample, then generate speech in that voice across many languages and accents[1][5][8]. You can fine-tune how the voice sounds—adjusting emotion, accent, rhythm, pauses, and intonation—to match your needs[1][3][5]. A major benefit is “zero-shot” cloning: you can make the cloned voice speak languages it was never trained on, which is rare in voice AI[1][3][4]. The latest version, OpenVoice V2, offers even better sound quality, supports six major languages natively, and is free for both personal and commercial use[1]. This makes it easy and affordable for anyone to create realistic, customizable voice content without needing technical expertise or expensive software.
https://github.com/myshell-ai/OpenVoice
OpenVoice is a free, open-source tool that lets you clone any voice using just a short audio sample, then generate speech in that voice across many languages and accents[1][5][8]. You can fine-tune how the voice sounds—adjusting emotion, accent, rhythm, pauses, and intonation—to match your needs[1][3][5]. A major benefit is “zero-shot” cloning: you can make the cloned voice speak languages it was never trained on, which is rare in voice AI[1][3][4]. The latest version, OpenVoice V2, offers even better sound quality, supports six major languages natively, and is free for both personal and commercial use[1]. This makes it easy and affordable for anyone to create realistic, customizable voice content without needing technical expertise or expensive software.
https://github.com/myshell-ai/OpenVoice
GitHub
GitHub - myshell-ai/OpenVoice: Instant voice cloning by MIT and MyShell. Audio foundation model.
Instant voice cloning by MIT and MyShell. Audio foundation model. - myshell-ai/OpenVoice