Welcome to a Natural Language Processing series, using the Natural Language Toolkit, or NLTK, module with Python
- Each "entity" that is a part of whatever was split up based on rules.
- For examples, each word is a token when a sentence is "tokenized" into words.
- Each sentence can also be a token, if you tokenized the sentences out of a paragraph.
- Stop words as words that just contain no meaning, we want to remove them.
- Words like we, she, is, a etc.
- The idea of stemming is a sort of normalizing method.
- Many variations of words carry the same meaning, other than when tense is involved.
- The reason why we stem is to shorten the lookup, and normalize sentences.
- Stemming can often create non-existent words, whereas Lemmas are actual words
- Stemming sometime can't have meaning in a dictionary, but Lemmas will defintely have meaning.
- This means labeling words in a sentence as nouns,adjectives,verbs...etc.
- Group words into hopefully meaningful chunks.
- One of the main goals of chunking is to group into what are known as "noun phrases."
- These are phrases of one or more words that contain a noun, maybe some descriptive words
- The idea is to have the machine immediately be able to pull out "entities"
- like people, places, things, locations, monetary figures, and more.
- There are two major options with NLTK's named entity recognition:
- Either recognize all named entities
- Or recognize named entities as their respective type, like people, places, locations, etc.