DEAR Translator – Progress Report #1

In this post, I will focus on recording my progress, my design choices and challenges, and the next steps of my Personal Project. 

Problem Definition 

Many second-generation immigrants experience challenges when communicating with their grandparents and relatives as they often do not fluently speak a common language. For example, a grandson who speaks limited Chinese and grandparents who do not speak English may only be able to talk about simple topics, such as greetings or daily routines, and rarely engage in meaningful conversations. 

Existing Solutions

There have been many available products with some form of speech translation, such as smartphone apps, earbuds, and portable devices. These products are broadly used for travelling, multilingual conferences, customer service, and education. They are developed to support general situations. 

My Idea

However, these products are mostly focused towards having something that can just simply translate across languages. My project will focus more on having flowing, meaningful conversations across language barriers. Specifically, it will have two core features: a fast, streamlined translator, and a personalized voice. It currently uses a three model pipeline with an Automatic Speech Recognition (ASR) model, a Machine Translation (MT) model, and a Text-To-Speech (TTS) Model. 

My Progress 

Since my last post on Feb 19, I have made smooth progress as planned. I finished the first stage (testing models) and the second stage (collecting data). 

For the first stage, I put together a pipeline using three models and test it on online data: 

  • OpenAI’s Whisper – To convert from voice to text 
  • Facebook’s M2M100 – To translate text to different languages 
  • Qwen’s Qwen3 TTS – To speak out the translated text. 

In the next stage, I used a script between a grandson and his grandparents to test my pipeline. It was made of 100 conversation-like sentences in both English and Chinese. This data is more similar to how this project will be used, so it is more realistic than online data. I recorded the English copy, and asked my mother to read the Chinese one. The pipeline did well enough, so I moved to the next stage. 

I am still working on the third stage: implementing the core features. I have worked on a threading feature to parallelize the models. So far, it has only been implemented on one of the three models. I am nearly finished with the personalized voice feature. I have included personalized voice into the text-to-speech component, and it works well to clone my voice, even across languages. One of my tests used a few seconds of only-English baseline, and was able to imitate my voice speaking Chinese. 

My Spoken English (Baseline)

AI Generated Chinese

My Spoken Chinese (Comparison)

Note: The generated speech was made from ONLY the Text-To-Speech model, not the full pipeline. 

Challenges So Far 

During my testing, I found that my ASR model, Whisper, kept returning the same phrases over and over when translating Chinese Audio. To fix this problem, I can either debug and try to fix this repetition glitch, or use a different ASR model. 

Design Choices

I have decided to try using a different ASR model. This choice also allows me to try a different pipeline approach. Rather than using three models for each step, I can find a single model that can do the first two steps. I have found two models so far that fit this: Meta’s SeamlessM4T, and IBM’s Granite 4.0. Each has their own advantages which can be viewed in “AI-Transcript-2.pdf” below. I will first be using IBM’s Granite as it is faster, less memory intensive, and usually has less errors. One of its primary disadvantages is it cannot translate any two supported languages, but must be translating from or to English. However, this is not a problem for now, as I am primarily using the English-Chinese pivot. 

Future Challenges 

A challenge I think I will face soon is speed. In my testing, the ASR model was quite fast, and only needed 1 second of processing for every 5 seconds of audio. However, the TTS model has been very slow, taking upwards of 17 seconds just to generate 2 to 3 seconds of speech. Although I am planning on implementing threading to parallelize the models, this cannot fix the core issue because a single TTS pass will already take 17 seconds and will interrupt the flow of the conversation. 

Next Steps 

To finish my third stage, I need to first flush out my implementation of voice cloning within the whole pipeline. I also need to finish the streamlined pipeline. After finishing, I should have a working pipeline that takes in a constant stream of audio, and returns a constant stream of translated speech. I plan to finish this by April 13th. During spring break (March 14th to 28th), I will be busy with a Band Tour, so I really only have 2 weeks to finish these two steps. I think I am pretty close to finishing, so this should be enough time. 

AI Transparency Statement 

I am also using AI to aid in my programming process. In particular, I am using Claude to help debug errors, and find specific python libraries or methods that can be useful for my specific need. For example, it showed me how to use the "decode_example" method of the "Audio" class to load audio from a file and cast it to a custom sampling_rate. 

In addition, I have also used Claude’s wide knowledge to compare AI models before I use them. 

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *