Simple statistics in NLP
2018-02-14
1.Frequency Distributions of words.
At most time,we can know topic about a text from the frequent words or infrequent words. So,in this case,we should know the Frequency Distribution of words or collocations.
We can do this with nltk as bellow:
fdist = FreqDist(text)
2.Select words by length.
Sometime,the length of words will tell us some information of text,specially with distribution of length of words. We can do this with nltk as bellow:
#select words by length
val = set(text)
uwords = [w for w in val if len(w)>7]
#get distribution of words length
ldist = FreqDist[len(w) for w in text]
3.Collocations & Bigrams.
Collocations:a sequence of words that occur together unusually often. Bigrams:the method provided by nltk to get pair of words in a text.
We can do this with nltk as bellow:
#get bigrams of list of words
pairs = bigrams(['word0','word1','word2','word3'])
#get collection of text
col = text.collections()