Secure voice recognition in Telegram

Xoxo, Reznik here and today we gonna recognise the voices …

Let’s take a break from affiliate networks for a while and create a bot that will help us with technical support. Its task is to recognize the content of voice and video messages in general Telegram chats. Let’s take Python as our development language. We will recognize speech using the free open source vosk library and punctuation using a model from Silero. The same mechanism is implemented in our AlterCPA Talk Bot.

Such a bot has the most important advantage: speech recognition occurs locally and does not require working with the Yandex or Google APIs. It is completely safe and suitable for paranoid and classified information. This bot can be launched by a novice developer or DevOps without deep specialized knowledge.

Step 1. Preparing the bot in Telegram

First of all, we need to create a bot and get its token for work.

  1. Go to Telegram and find there @BotFather
  2. Send the bot the command /newbot to create a new bot
  3. Indicate the name and future address of our bot
  4. The bot is created, copy the received token and write it on a piece of paper
  5. Send the bot the command /mybots to display the list of bots
  6. Select the newly created bot in the list
  7. Press Bot Settings
  8. Press Allow groups and ensure Groups are currently enabled for bot

The bot is ready and approved to work with group chats. This way he can automatically recognize messages in groups.

Step 2. Prepare the server

As you may have guessed, we need a virtual server. My bots run on EasyOne and Danny tariffs from Timeweb … The operating system is pure Debian 10. In fact, any operating system will do, but step 5 on other operating systems may differ from what is indicated here. To install the software, you will need about 1.5 GB of space, language models will require about 2-2.5 GB more.

Before starting work, be sure to update the system:

apt update && apt upgrade -y

Install Python and other necessary software:

apt install -y python3 python3-dev python3-pip unzip ffmpeg

Install the modules necessary for work: torch, vosk, PyTelegramBotAPI and their dependencies:

pip3 install PyTelegramBotAPI
pip3 install vosk
pip3 install wave
pip3 install numpy
pip3 install torch
pip3 cache purge

Please note that some modules are quite weighty, the installation may take a long time.

Step 3. Download models for speech recognition

Models for recognition will be stored locally. Let’s put them in the /home/ml folder. They will need about 2-3 GB of free space. Available languages and recognition models are on the vosk website in the Models section. At the time of writing, the most recent model: vosk-model-en-0.22, we will use it.

The punctuation model is extracted from the Silero Github repository, in which the file models.yml. In it we find te_models and its package in latest. This is the link you are looking for, mine led right here.

It is best to run the commands one at a time – loading models can take a lot of time.

mkdir /home/ml
cd /home/ml
wget https://alphacephei.com/vosk/models/vosk-model-en-0.22.zip
unzip vosk-model-en-0.22.zip
rm -f vosk-model-en-0.22.zip
wget https://models.silero.ai/te_models/v2_4lang_q.pt

The models have been downloaded and ready to go. Please note that at the time of reading the article, the links may be new. Use the most recent versions of the files.

Step 4. Writing the bot

You can download the finished bot file voxy.py in our Gitlab repository. Just upload it to the server and enjoy the process. The process is described in the next step. And in this we will analyze its code point by point.

At the beginning of the file, we indicate the version of Python with which we will work. Perhaps yours will be much fresher, check it through which python3.

#!/usr/bin/python3
# coding: utf-8

Connect all the modules downloaded in the second step.

import telebot
import pathlib
import requests
import subprocess
import os
import json
import wave
import torch
from vosk import Model, KaldiRecognizer

Set the token of our bot from the first step.

TOKEN = '12345:AAAA-BBBBBB_CCC'

Set the paths to the models and the language used. If you acted strictly according to the instructions, the paths will not change.

MODEL = r"/home/ml/vosk-model-en-0.22"
TEMODEL = "/home/ml/v2_4lang_q.pt"
LANG = 'en'

Preparing the bot and models.

WORKDIR = str(pathlib.Path(__file__).parent.absolute())
bot = telebot.TeleBot( TOKEN )
model = Model( MODEL )
voska = KaldiRecognizer( model, 16000 )
tmodel = torch.package.PackageImporter( TEMODEL ).load_pickle( "te_model", "model" )

We will intercept all voice and video messages.

@bot.message_handler(content_types=["voice","video_note"])
def voice_decoder(message):

Next comes the code for the nested function, don’t forget about indentation. The first step is to check the message type.

    if ( message.voice != None ):
        file =  message.voice
    elif ( message.video_note != None ):
        file =  message.video_note
    else:
        return False

Download the file attached to the message.

    finfo = bot.get_file(file.file_id)
    try:
        contents = requests.get( 'https://api.telegram.org/file/bot{0}/{1}'.format(TOKEN, finfo.file_path) )
    except Exception:
        return False

Let’s save the file right next to the bot. We found the path WORKDIR during initialization.

    downpath = WORKDIR + "/" + file.file_unique_id
    with open( downpath, 'wb' ) as dest:
        dest.write(contents.content)

Let’s convert the file with a magic command, which we will show below.

    path = audioconvert( downpath )
    if ( path == False ):
        return False

Let’s convert the file to text by simply calling the recognition model. If everything worked out, we run the text according to the improvement model.

    text = speech2text( path )
    os.remove( path )
    if ( text == False or text == "" or text == " " ):
        return False
    else:
        text = tmodel.enhance_text( text, LANG )

We send a message as a response, this is where the whole work of recognition ends.

    bot.reply_to(message, text)

We will use FFmpeg to convert audio, because I don’t know any other options. We need a file with a 16k bitrate, PCM format, mono.

def audioconvert(path):
    out_path = path + ".wav"
    command = [
        r'/usr/bin/ffmpeg',
        '-i', path,
        '-acodec', 'pcm_s16le',
        '-ac', '1',
        '-ar', '16000',
        out_path
    ]
    result = subprocess.run(command, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL )
    os.remove( path )
    if ( result.returncode ):
        os.remove( out_path )
        return False
    else:
        return out_path

To extract text via vosk, we will use the ready-made magic from the developers. I have no idea how it works or why it is. But it works.

def speech2text(path):
    wf = wave.open(path, "rb")
    result = ''
    last_n = False
    while True:
        data = wf.readframes(16000)
        if len(data) == 0:
            break
        if voska.AcceptWaveform(data):
            res = json.loads(voska.Result())
            if res['text'] != '':
                result += f" {res['text']}"
                last_n = False
            elif not last_n:
                result += '\n'
                last_n = True
    res = json.loads(voska.FinalResult())
    result += f" {res['text']}"
    return result

The last step is to launch the bot in active mode.

if __name__ == '__main__':
    bot.infinity_polling()

Step 5. Launching the bot as a service

Let’s place the bot in the /home/bot folder and launch it there.

mkdir /home/bot
cd /home/bot
wget https://gitlab.com/altervision/altercpa-voxy/-/raw/main/voxy.py
chmod a+x voxy.py
nano voxy.py

Specify the bot token from step 1, save (Ctrl + O, Enter, Ctrl + X). We try to turn on the bot and check if it starts.

./voxy.py

Several loading messages will appear and in a few seconds the bot will be ready. Send him a voice message and wait for a response. This step is expected to show no errors. If they appear, pull your hair out, throw tantrums and write complaints to the White House.

Let’s create a file that will be responsible for the operation of our service. Let’s call the service voxybot.

nano /lib/systemd/system/voxybot.service

The content of the file will be something like this.

[Unit]
Description=VoxyBot

[Service]
Type=simple
Restart=on-failure
RestartSec=5s
ExecStart=/home/bot/voxy.py

[Install]
WantedBy=multi-user.target

Update the services, turn on the bot and start the service.

systemctl daemon-reload
systemctl enable voxybot.service
service voxybot start

Step 6. Checking the work

Your bot is running. Open a dialogue with it and send it a voice message. And then a video message. If there is no answer, you know what to do. Panic, denial, anger, bargaining, depression, acceptance of the inevitability of contacting tech support or learning Python. If the answer is text, just use it.

TL;DR: easy installation

After receiving the bot token, go to the server and run the following commands:

wget https://gitlab.com/altervision/altercpa-voxy/-/raw/main/setup-en.sh
bash setup-en.sh YOURBOTTOKEN

Where instead of YOURBOTTOKEN specify the token received from BotFather. Installation can take a couple of hours – the script needs to download about 3.5 GB of files. After installation, the bot will start working on its own.

Thanks y’all, Reznik out!