Speech Tools Minicourse, Summer 2020

Meeting 5: Transformer

This file (m5.ipynb) and the executable it generated (m5.py) are available using git clone https://github.com/uiuc-sst/minicourse; cd minicourse; git checkout 2020.

Today's outline

  1. UDHR Corpus
  2. A very simple Transformer-based speech recognizer
  3. Training the transformer
  4. Testing the simple speech recognizer
  5. What's next?

1. UDHR Corpus

The UDHR (Universal Declaration of Human Rights) is available from the United Nations in several hundred translations. Audio recordings are being collected and distributed in the public domain by librivox. We have started to segment those recordings, for use in multilingual speech technology.

To get it, first clone the repo:

git clone https://github.com/uiuc-sst/udhr
cd udhr

Then pip install some necessary packages, in case you don't already have them:

pip install pycountry praatio librosa h5py

The rest of the installation can be done from python, as follows:

In [1]:
import os,pycountry,librosa,h5py,torch
import matplotlib.pyplot as plt
from praatio import tgio
import numpy as np
UDHR_ROOT=os.path.expanduser('~/data/librivox/udhr')
os.chdir(UDHR_ROOT)

Now you can import some useful functions from the udhrpy subdirectory.

  • load_audio() downloads the audio, unzips it, converts it to WAV, and segments it. If you haven't already done those things, it will take a long time.
  • create_hdf5() converts the WAV files to melspectrograms, and saves them in an hdf5 file.
  • UDHR_Dataset() is actually a pytorch Dataset object, specialized for the UDHR corpus.
In [2]:
import udhrpy
udhrpy.load_audio()
udhrpy.create_hdf5('UDHR.hdf5')
dataset=udhrpy.UDHR_Dataset('UDHR.hdf5')
Segmenting ['human_rights_03_eng_cc_64kb', 'human_rights_03_eng_jkb_64kb', 'human_rights_03_eng_mdk_64kb', 'human_rights_03_epo_njb_64kb', 'human_rights_un_afk_cdb_64kb', 'human_rights_un_arz_ef_64kb', 'human_rights_un_bal_brc_64kb', 'human_rights_un_bug_brc_64kb', 'human_rights_un_bul_eu_64kb', 'human_rights_un_cat_nv_64kb', 'human_rights_un_chi_nf_64kb', 'human_rights_un_chn_cz_64kb'] from exp/wav to exp/audio
Creating 0'th melspectrogram: human_rights_un_chi_nf_64kb_0044
Creating 100'th melspectrogram: human_rights_un_chi_nf_64kb_0093
Creating 200'th melspectrogram: human_rights_un_afk_cdb_64kb_0030
Creating 300'th melspectrogram: human_rights_un_bul_eu_64kb_0060
Creating 400'th melspectrogram: human_rights_un_bul_eu_64kb_0013
Creating 500'th melspectrogram: human_rights_un_bal_brc_64kb_0020
Creating 600'th melspectrogram: human_rights_un_bal_brc_64kb_0009
Creating 700'th melspectrogram: human_rights_un_chi_nf_64kb_0007
Creating 800'th melspectrogram: human_rights_03_epo_njb_64kb_0053
Creating 900'th melspectrogram: human_rights_un_arz_ef_64kb_0004
Creating 1000'th melspectrogram: human_rights_un_bul_eu_64kb_0066
Creating 1100'th melspectrogram: human_rights_un_bal_brc_64kb_0096
Creating 1200'th melspectrogram: human_rights_03_epo_njb_64kb_0107

Now let's look at some of the melspectrograms.

In [5]:
tok = -3
print(dataset[tok].keys())
nframes = min(300,dataset[tok]['melspectrogram'].shape[1])
sgmax=np.amax(dataset[tok]['melspectrogram'][:,0:nframes])
dbgram=np.log(dataset[tok]['melspectrogram'][:,0:nframes]+1e-6*sgmax)
nphones=min(len(dataset[tok]['phones'][:]),27)
nchar=min(len(dataset[tok]['text'][:]),10)
phones = ''.join(dataset.idx2phone[p] for p in dataset[tok]['phones'][0:nphones])
text = ''.join(dataset.idx2char[c] for c in dataset[tok]['text'][0:nchar])
languagename = dataset[tok]['languagename'][()]
uttid = dataset[tok]['uttid'][()]

fig, ax=plt.subplots(figsize=(14,8))
ax.imshow(dbgram,origin='lower')
ax.set_title('%s (%s): /%s/ (%s)'%(uttid,languagename,phones,text))    
<KeysViewHDF5 ['iso639-3-iso3166-1', 'languagename', 'melspectrogram', 'nsamps', 'phones', 'samprate', 'text', 'uttid']>
Out[5]:
Text(0.5, 1.0, 'human_rights_un_chn_cz_64kb_0014 (Mandarin Chinese): /ji˥ntsʰɨ˨˩˦ɕjɛ˥˩ntsa˥˩jtɑ˥˩/ (因此现在,大会,发布)')

2. A very simple Transformer-based speech recognizer

The pytorch Transformer module already does almost everything we want: https://pytorch.org/docs/master/generated/torch.nn.Transformer.html#torch.nn.Transformer

In order to get this to train in a short time, we'll use a really small model (so the ending performance won't be so great). Let's try just 1 layer for both encoder and decoder, 128-dimensional model, with a 128-dimensional linear layer, and 4 heads.

In [6]:
transformer_params = {
    'd_model':128,
    'nhead':4,
    'num_encoder_layers':1,
    'num_decoder_layers':1,
    'dim_feedforward':128
}
transformer_model = torch.nn.Transformer(**transformer_params)

We also need to embed the IPA phone characters into a 128-dimensional space at the input, and then back again at the output. Let's leave the number of distinct phones as an input parameter, called nphones.

When we run the forward algorithm, let's also provide a tgt_mask, to keep out[i] from attending to any tgt[j] if j>=i.

In [7]:
class MyNetwork(torch.nn.Module):
    def __init__(self, nphones, transformer_params):
        '''Phone embedding, transformer, output layer.
        nphones = number of distinct phone symbols to be learned.
        '''
        super(MyNetwork, self).__init__()
        self.phone_embedding = torch.nn.Embedding(nphones, 128)
        self.transformer_model = torch.nn.Transformer(**transformer_params)
        self.output_layer = torch.nn.Linear(128, nphones)
    def forward(self,src,tgt_indices,tgt_mask):
        '''Given melspectrogram and target indices,
        predict the next phone after each given phone.
        '''
        tgt  = self.phone_embedding(tgt_indices)
        out = self.transformer_model(src=src, tgt=tgt, tgt_mask=tgt_mask)
        scores = self.output_layer(out)
        return(scores)
In [8]:
nphones = len(dataset.idx2phone)
model = MyNetwork(nphones, transformer_params)
print(model)
MyNetwork(
  (phone_embedding): Embedding(129, 128)
  (transformer_model): Transformer(
    (encoder): TransformerEncoder(
      (layers): ModuleList(
        (0): TransformerEncoderLayer(
          (self_attn): MultiheadAttention(
            (out_proj): Linear(in_features=128, out_features=128, bias=True)
          )
          (linear1): Linear(in_features=128, out_features=128, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
          (linear2): Linear(in_features=128, out_features=128, bias=True)
          (norm1): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
          (norm2): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
          (dropout1): Dropout(p=0.1, inplace=False)
          (dropout2): Dropout(p=0.1, inplace=False)
        )
      )
      (norm): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
    )
    (decoder): TransformerDecoder(
      (layers): ModuleList(
        (0): TransformerDecoderLayer(
          (self_attn): MultiheadAttention(
            (out_proj): Linear(in_features=128, out_features=128, bias=True)
          )
          (multihead_attn): MultiheadAttention(
            (out_proj): Linear(in_features=128, out_features=128, bias=True)
          )
          (linear1): Linear(in_features=128, out_features=128, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
          (linear2): Linear(in_features=128, out_features=128, bias=True)
          (norm1): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
          (norm2): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
          (norm3): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
          (dropout1): Dropout(p=0.1, inplace=False)
          (dropout2): Dropout(p=0.1, inplace=False)
          (dropout3): Dropout(p=0.1, inplace=False)
        )
      )
      (norm): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
    )
  )
  (output_layer): Linear(in_features=128, out_features=129, bias=True)
)

Pytorch has no built-in positional encoding. So let's create one:

In [9]:
def positional_encodings(nframes, d_model):
    times = np.arange(nframes,dtype=np.float)
    frequencies = np.power(10000.0,-2*np.arange(d_model/2)/d_model)
    phases = torch.tensor(data=np.outer(times,frequencies), dtype=torch.float)
    return(torch.cat((torch.sin(phases),torch.cos(phases)), dim=1))
print('encodings for a 5x4:')
print(positional_encodings(5,4))
encodings for a 5x4:
tensor([[ 0.0000,  0.0000,  1.0000,  1.0000],
        [ 0.8415,  0.0100,  0.5403,  0.9999],
        [ 0.9093,  0.0200, -0.4161,  0.9998],
        [ 0.1411,  0.0300, -0.9900,  0.9996],
        [-0.7568,  0.0400, -0.6536,  0.9992]])

3. Training the transformer

If we combine files with different lengths into the same forward pass, then we'll need to zero-pad them all to the same sequence length (S for src, T for tgt), and then use src_mask and tgt_mask in clever ways, to keep it from paying attention to the dummy data. Instead, let's use a minibatch size of N=1.

That means we just need to unsqueeze the melspectrogram and phones, before running the model, to match the documentation for torch.nn.Transformer:

  • src (melspectrogram) should be size (S,N=1,128)
  • tgt (phone embedding) should be (T,N=1,128), therefore
  • phones should be (T,N=1).

So all of those are unsqueezed on dimension 1 (N). We also need to be careful to correctly time-align the transformer tgt and the reference -- we want tgt to be constructed from all phones but the last one (phones[:-1]), and we want the reference sequence for scoring to consist of all but the first (phones[1:]).

Finally, we design tgt_mask so it adds -np.inf to any e[i,j] if j>i. We can do this by creating an all-ones matrix, multiplying by -np.inf, then keeping only the upper triangular part above the +1 diagonal.

In [10]:
# In UDHR corpus, space symbol is used between words.  We'll use it for silence. 
silence_class = dataset.idx2phone.find(' ')  

def prepare_token(dataset, i):
    '''Prepare data from group=dataset[i] for some index i'''
    # Transpose the sgram, add positional encoding, unsqueeze it
    sgram = torch.tensor(dataset[i]['melspectrogram']).T
    S = sgram.size()[0]
    src = torch.unsqueeze( sgram + 1e-6*positional_encodings(S,128), 1)
    
    # tgt_indices = silence, then all but the last phone
    ref_indices = torch.tensor(dataset[i]['phones'])
    tgt_indices = torch.tensor(data=[silence_class]+list(ref_indices[:-1]))
    tgt_indices = torch.unsqueeze(tgt_indices,1)
    T = tgt_indices.size()[0]
    
    #tgt_mask prevents each phone from looking at any future phone
    tgt_mask = torch.triu(torch.ones(T,T,dtype=torch.double)*(-np.inf),diagonal=1)
    
    return(src, tgt_indices, tgt_mask, ref_indices)
In [11]:
def training_step(model,dataset,i,lossfunc,stepnum,optimizer):
    optimizer.zero_grad()
    (src, tgt_indices, tgt_mask, ref_indices) = prepare_token(dataset,i)
    if ref_indices.size()[0]<1:
        print('Zero-length tok: %d of epoch %d, %s'%(i,stepnum,dataset[i]['uttid'][()]))
        return(0)
    scores = torch.squeeze(model(src,tgt_indices,tgt_mask),1)
    L = lossfunc(scores,ref_indices)
    L.backward()
    optimizer.step()
    return(L.item())  # return the loss score
In [12]:
lossfunc = torch.nn.CrossEntropyLoss()
optimizer=torch.optim.Adam(model.parameters(),betas=(0.9,0.98),eps=1e-9,lr=0.01)
In [236]:
avg_loss=training_step(model,dataset,0,lossfunc,0,optimizer)
for stepnum in range(2):
    for i in range(len(dataset)):
            loss=training_step(model,dataset,i,lossfunc,stepnum,optimizer)
            if loss > 0:
                avg_loss = 0.95*avg_loss + 0.05*loss
            if i%100==0:
                print('Epoch %d token %d: loss %g'%(stepnum,i,avg_loss))
Epoch 0 token 0: loss 3.99487
Zero-length tok: 33 of epoch 0, human_rights_un_bug_brc_64kb_0009
Epoch 0 token 100: loss 2.91713
Epoch 0 token 200: loss 2.81115
Epoch 0 token 300: loss 2.93242
Epoch 0 token 400: loss 3.19977
Epoch 0 token 500: loss 3.57512
Epoch 0 token 600: loss 3.40984
Epoch 0 token 700: loss 3.3914
Epoch 0 token 800: loss 3.41766
Epoch 0 token 900: loss 3.3893
Epoch 0 token 1000: loss 3.33244
Epoch 0 token 1100: loss 3.33524
Epoch 0 token 1200: loss 3.35514
Epoch 1 token 0: loss 3.31072
Zero-length tok: 33 of epoch 1, human_rights_un_bug_brc_64kb_0009
Epoch 1 token 100: loss 3.06419
Epoch 1 token 200: loss 2.87602
Epoch 1 token 300: loss 2.9462
Epoch 1 token 400: loss 3.12043
Epoch 1 token 500: loss 3.49335
Epoch 1 token 600: loss 3.36815
Epoch 1 token 700: loss 3.33558
Epoch 1 token 800: loss 3.39144
Epoch 1 token 900: loss 3.35435
Epoch 1 token 1000: loss 3.27673
Epoch 1 token 1100: loss 3.33043
Epoch 1 token 1200: loss 3.33977

4. Testing the simple speech recognizer

Now, we can test it by calling the model on a sample, and interpreting the output.

In [249]:
tok = 1
(src, tgt_indices, tgt_mask, ref_indices) = prepare_token(dataset,tok)
languagename = dataset[tok]['languagename'][()]
uttid = dataset[tok]['uttid'][()]
print('Token %d %s (%s)'%(tok,uttid,languagename))
text = ''.join(dataset.idx2char[c] for c in dataset[tok]['text'][0:nchar])
print(text)
phones = ''.join(dataset.idx2phone[p] for p in dataset[tok]['phones'][:])
print('Correct transcription: '+phones)

scores = torch.squeeze(model(src,tgt_indices,tgt_mask), 1)
best_idx = torch.argmax(scores, 1)
out = ''.join(dataset.idx2phone[p] for p in best_idx)
print('The ASR thought it heard: '+out)
Token 1 human_rights_03_eng_cc_64kb_0010 (English - United States)
Now, there
Correct transcription: #ˈaʊ# #ðˈɛɹ.fˌɔɹ#
The ASR thought it heard: ######## ###### #

At this point, the ASR seems to have learned which one or two symbols are most frequent in each of the languages in the corpus (and it seems to have learned to recognize which language is being spoken!), but it hasn't yet learned to recognize any phonemes.

5. What's next?

Here are some things that need to happen, before this idea can become a useful speech recognizer:

  1. Better loss function. From a phonetics point of view, it doesn't really make sense to require the ASR to get exactly the right IPA symbol --- that's not how the symbols of the IPA are meant to be used. Instead, it should be scored on the number of articulatory features it gets right.
  2. Better phone transcriptions. The grapheme-to-phoneme models I'm using are from https://github.com/uiuc-sst/g2ps, they have not really been re-trained since 2017, and though some of them are quite good, some of them are quite awful. We need to at least figure out which ones are worth using; better still, I'd like to use transformers, with a whole lot of dictionaries from many languages, to train better G2Ps.

It is also probably a good idea to use larger speech databases. Figuring out how to balance the amount of data per language vs. the number of languages vs. the complexity of the model vs. the training time is a hard problem.

In [ ]: