Accessing Corpus by NLTK

2018-02-15

1.Inner corpus.

For inner corpus of NLTK , just import them as bellow:

from nltk.corpus import <****>

2.User corpus.

For user own corpus , you need load them by NLTK reader as bellow:

#for plain text
from nltk.corpus import PlaintextCorpusReader
path = '~/text'
pattern = r'*.txt'
texts = PlaintextCorpusReader(path,pattern)

#for mrg format text
from nltk.corpus import BracketParserCorpusReader
mrg_path = '~/mrg'
mrg_pattern = r'.*/wsj_.*\.mrg'
mrgs = BracketParserCorpusReader(mrg_path,mrg_pattern)

3.API of corpus obj

API	Description
fileids()	return list of all file ids of this corpus
fileids([categories])	return list of file ids belong to corresponding categories
categories()	return list of all categories of this corpus
categories([fileids])	return list of categories of these fileids
raw	return the list of all chars of the corpus
raw(filedis=[fileids])	return the list of chars of these fileids
raw(categories=[categories])	return the list of chars of these categories
words	same as raw
sents	same as words
abspath(fileid)	return the absolute path of fileid
encoding(fileid)	return the encoding of fileid
open(fileid)	open the file return file object
root()	return the root path of corresponding corpus
readme()	return the readme file of corresponding corpus