Solving Multimodal Problem using Deep Learning: Speaker Identification
Abstract
Speaker identification refers to the task of locating the face of a person with the same identity as the voice in a video. Speaker identification is a challenging perception task integrating both visual and auditory signals. Long Short Time Memory (LSTM) is a natural choice for solving such multimodal learning task of sequence data in a unified model. One of the most straightforward methods to deal with data from different domains is to integrate them into a single network by directly concatenating the data to produce a larger input sequence. However, in this case, the multimodal property of the inputs is completely ignored. Another solution is to treat data from different domains completely independently. We can use multiple LSTMs in parallel and then use a voting mechanism to merge the output labels at the highest layer. The advantage of this approach is that the two separate memory units can be trained explicitly for each domain to store useful information. But the weakness is that the interaction between modalities occurs only at the highest level during the labeling process. A better solution might be a multimodal LSTM architecture that unifies both visual and auditory modalities from the beginning of each sequence input and then extend the conventional LSTM not only by sharing weights across time steps but also by sharing weights across modalities. We are comparing each of these models and the assumptions behind them and trying to find out the best model for solving multimodal problems like speaker identification.