#BSD_Extractor

Transforms raw-text files into their disambiguated Synset versions, producing a file with the following content structure:

word \t synset \t \offset \t pos For all the words that exist in wordnet (except stopwords)

A list of these new files is also produced.

The program uses Python 3.X, Wordnet 3.0 (from nltk library), numpy, functools and a trained word embeddings model (binary or not).

Transforms text files into synset word files considering WSD via glosses

COMMAND LINE (wn_manage.py) :

python3 wn_manage.py  --input <input-location> --output <output-location> --recur <bool> --abase <bool> --model <model-file>

<--input> : Input folder with .txt files or folders with .txt
<--output>: Ouput folder where the files will be saved
<--recur> : [OPTIONAL and case sensitive] True (default): Use synset-embeddings (trained on the output of this algorithm); False: Use word-embeddings (e.g. GoogleNews);
- Important: The synset embeddings consider this structure word#offset#pos as keys. However, to make things more flexible, the ouput of this program produces word\tsynset\toffset\tpos. Pleave refer to BSD_Parser to filter/parser the components you require;
<--abase> : [OPTIONAL and case sensitive] True (default): Disambiguation via Base MSSA; False - Disambiguation via Dijkstra;
<--model>: Word-Embedding (e.g. GoogleNews) OR Synset-Embbedding model used. This should be in .vector format, but it can be changed to binary. Synset-Embbeddings consider the following canonical format: word#offset#pos . These are the keys to look up the embeddings.

Models and Corpora:

All datasets, training corpora and generated models for the paper "Multi-Sense embeddings through a word sense disambiguation process" can be found at DeepBlue repository

UPDATES:

[2019-05-15]

Link for the datasets/models included.

[2019-03-29]

Included URL for models and datasets used.

[2019-03-07]

Moving project from personal repository and renaming it

[2019-01-31]

CommandLine class implemented - Refactor

[2019-01-11]

Ignore ASCII error when reading and cleaning files

[2018-12-06]

Refinement script removed. Everything is done via command line using wn_manage.py only
New instructions added on how to run
General refactoring
Enhancement: Gloss-avg-vect was being calculated regardless if recurrent/refinement model was used (synset-embeddings). It doesn't affect the result, but add substantial computing time.
Left some toy input files for testing

[2018-05-14]

Several refactoring performed
If document parsed has not synset-tokens, this document won't be produced (output file is discarded)
Dijkstra with Refinement model implemented

[2018-05-09]

Fix: If document has only one word we pick the Most Common Sense (MCS) to represent that word (single-word-document). Only for normal approach (wn_manage.py - uses word2vec)
To-Do: Implement the same thing for refinement approach (refi_manage - uses synset2vec)

[2018-04-24]

Circular references fixed
No need to use PYTHONPATH=.. anymore
adjust on import modules
Pytho path fixed

[2018-04-24]

refi_manage.py included. Works in the same way as 'wn_manage', but uses the synset_vector (synset2vec) for the words in the input text
Model used for 'refi_manage' has to be in the format of synset2vec (word#offset#pos)
Minor errors on Dijkstra approach fixed - cost function altered to distance instead of similarity

[2018-03-30]

Improved method for reading/cleaning text input
Code working with multiple-folder-input (folder inside folder with input files) - files/input holds a toy example which can be executed directly
fname_splitter for UNIX and WINDOWS structure
Refactored in some items

[2018-03-12]

PYTHONPATH=.. must be used in command line to avoid problems with local-import between packages
folders with input;folder;model must be under synset_module/ to run with command-line
If nltk-wordnet is not install run python3 import nltk nltk.download('wordnet') or include it in the program

[2018-03-02] CosineDistance for Dijsktra; CosineSimilarity for Window-context; Implemented INDEX-COST look-up table based on a mapping created from the Wordnet dictionary using a Crystal program (look at Block BETA (wn_manage.py) and distance_reader.py for details)

[2018-03-01] Dijkstra implemented for disambiguating word-synset-nodes based on cosine distance (using priority queue) ;Synset disambiguation based on window-word context [-/+1]; Numpy applied to vector operations (averaging gloss-vector); CosineDistance/CosineSimilarity updated; Execution time (naive) for control issues

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.idea		.idea
files/input		files/input
general_module		general_module
text_module		text_module
wordnet_module		wordnet_module
.project		.project
.pydevproject		.pydevproject
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

#BSD_Extractor

COMMAND LINE (wn_manage.py) :

Models and Corpora:

UPDATES:

About

Uh oh!

Releases

Packages

Uh oh!

Languages

truas/MSSA

Folders and files

Latest commit

History

Repository files navigation

#BSD_Extractor

COMMAND LINE (wn_manage.py) :

Models and Corpora:

UPDATES:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages