Shylock Hg

My own blog powered by Hugo and Ivy.

Accessing web and local text

2018-02-17


1.Handling plain web text

1.1.Accessing web text

Accessing web text as bellow:

import urllib

url = 'http://www.gutenberg.org/files/2554/2554.txt'

#get the string of text file
raw = urllib.urlopen(url).read()

#with proxy
proxy = {'http':'http://www.yourproxy.com:443'}
raw = urllib.urlopen(url,proxies=proxy).read()

1.2.Tokenizing the text

Tokenizing a text(string) to produce a list of tokens.

#context same as upper

#tokenize the text string
tokens = nltk.word_tokenize(raw)

1.3.Creating nltk.Text object

We can handle text by nltk after creating nltk.Text object from text.

#context same as upper

#creating nltk.Text object by tokens
text = nltk.Text(tokens)

Then we can handle text by API belong to nltk.Text object,such as:

#context same as upper

#get collocations of text
text.collocations()

2.Handling HTML document

2.1.Accessing HTML document

Accessing HTML document as bellow:

import urllib

#get html document by url
url = 'http://news.bbc.co.uk/2/hi/health/2284783.stm'
html_doc = urllib.urlopen(url).read()

#get plain text content of html document
raw = nltk.clean_html(html_doc)

#tokenize the text string
tokens = nltk.wor_tokenize(raw)

#create nltk.Text object
text = nltk.Text(tokens)

note:But there are still a lot content that we not need.You can clean them by hand or use the professional tools BeautifulSoup

3.Handling Searching Engine Results

4.Handling RSS Feeds

Accessing RSS Feeds as bellow:

import feedparser
import nltk

#get blog
llog = feedparser.parse('http://languagelog.ldc.upenn.edu/nll/!feed=atom')

#overview blog
llog['feed']['title']

#post count
len(llog.entries)

#get post
post = llog.entries[2]
post.title

#get content of post
content = post.content[0].value

#tokenize content
tokens = nltk.word_tokenize(nltk.html_clean(content))

5.Handling Local Files

As bellow:

import nltk

#read raw 
f = open('document.txt')
raw = f.read()

#read line
f = open('document.txt','rU'
for line in f:
	print(line.strip())

#accessing nltk corpora
path = nltk.data.find('corpora/gutenberg/melville-moby_dick.txt')
raw = open(path,'rU').read()

6.Handling PDF,MSWord & other Binary Format

Use the third library or extract by hand.

7.Handling User Input

str = raw_input('Enter some text:')

8.The common nlp data transition

As bellow:

nlp_pipeline.png