PhD in Cross-modal Deep Learning between Vision, Language, Audio and Speech

/, Cognitive Computing, Deep Learning/PhD in Cross-modal Deep Learning between Vision, Language, Audio and Speech

Would you like to join us? Call open for the INPhINIT-“la Caixa” Doctoral fellowship programme. under the BSC research center, that offers 20 PhD INPhINIT positions.   INPhINIT is a doctoral fellowship programme devoted to attracting international Early-Stage Researchers to the top Spanish research centres. INPhINIT is promoted by “la Caixa” Foundation with the aim of supporting the best scientific talent and fostering innovative and high-quality research in Spain by recruiting outstanding international students and offering them an attractive and competitive environment for conducting research of excellence.

Deep neural networks have boosted the convergence of multimedia data analytics in a unified framework shared by practitioners in natural language, vision, audio and speech. Image captioning, lip reading or video sonorization are some of the first applications of a new and exciting field of research exploiting the generalisation properties of deep learning. This project aims at exploring deep neural network architectures capable of projecting any multimedia signal in a joint embedding space based on its shared semantics. This way, an image of a dog would be projected in a similar representation of an audio snippet of a dog barking, the natural language text “a dog barking”, or a speech excerpt of a human reading this sentence aloud. The joint multimedia embedding space will be used to address two basic applications with a common core technology in terms of machine learning:

a) Cross-modal retrieval: Given a data sample characterized by one or more modalities, match them with the most similar data sample from other modalities from a catalogue. This would be the case, for example, of a visual search problem where a picture from a food dish is match with the recipe followed to cook it, or viceversa.

b) Cross-modal generation: Given a data sample characterized by one or more modalities, generate a new signal in another modality matching it. This would be the case of a lip-reading application, where a video stream of a person speaking is used to generate a text or speech signal that may match the spoken words. Another option would be generating an image or scheme based on an oral description of it. Generative models will rely on training schemes such as Variational AutoEncoders (VAE) and/or Generative Adversarial Networks (GANs).

The development of the research project will require skills in Python and deep learning frameworks such as TensorFlow and/or PyTorch. The project could be redefined, however, at this moment, the candidate should design a deep neural network that may contain the basic blocks of convolutional and recurrent neural networks to process the diversity of multimodal data considered in this project. It will be necessary to write the data loaders for an efficient use of multiple GPUs. The candidate must prove previous experience in such systems.

If you’re interested in working in this area and need more detail, please contact with the PhD advisors Xavier Giró-i-Nieto or Jordi Torres.

APPLY NOW (Deadline for application: February 1st, 2018)


Further information:

INPhINIT Programme description

Doctoral studies at Spanish Research CoE

PhD Position Search


2018-01-04T23:48:18+00:00 January 4th, 2018|