The goal of this topic is to create a multimodal dataset which will be saved in a database and is accessible through a REST-API. An iOS-App is made which uses the API to display the multimodal documents. The App can be used via eyetracking.
Documents (e.g. books, wikipedia articles ...) can be processed to create a multimodal dataset. For this, the focus words of the sentences are found. A focus word is a word that is complex and depictable at the same time. Then, for the sentences with at least one focus word an image is retrieved from the image dataset. In best case, the image represents the focus word/s and the context of the sentence. The next step is to save different versions of that image in which the focus word/s is/are highlighted. A document with at least one sentence that has at least one focus word and an image will be saved in a database.
A video of the app can be found in this repository
Please have a look at the wiki for a detailed description in how to install and use the multimodal dataset builder, the api and the frontend.
To learn about the basic mechanims and techniques please look into our wiki. The documentation for the backend and the NLP pipeline can be found in the respective folders.
The iOS app is build using apple's basic UI Framework called UIKit. Therefore we did not documentate the default mechanisms and functions, like generating basic UI elements. More over it is common in iOS development to rather use long signitures and varibale names, which describe the functions good enaugh, so less documentation is needed. However, the interesting part of the application is how the eye tracking works, which is done using multiple service classes. We documented them since they are not trivial. They can be found in the files laying in the Multimodal Learning App/Multimodal Learning/Tools folder.
Especially the multimodal dataset builder uses the work of others. These are referenced here:
- Idea, Pipeline and Parameters
- Simple Wikipedia Articles
- Images/Paper
- Complex Word Identifier
- Visual Concreteness and Implementation
- Image Retrieval/Paper
- Image Highlighting
For the multimodal dataset creation, we included the complex word identifier and miniCLIP to our code. That's why we want to explicitly point out how they are licensed.
| Name | License | URL |
|---|---|---|
| Complex Word Identifier | Unkown | https://github.com/in2dblue/mastersThesis |
| miniCLIP | Apache 2.0 | https://github.com/HendrikStrobelt/miniClip |
The file "DEP_LICENSES.md" shows how the dependencies of the multimodal dataset builder are licensed.