API Reference¶
Blob Classes¶
Wrappers for various units of text, including the main
TextBlob
, Word
,
and WordList
classes.
Example usage:
>>> from textblob import TextBlob
>>> b = TextBlob("Simple is better than complex.")
>>> b.tags
[(u'Simple', u'NN'), (u'is', u'VBZ'), (u'better', u'JJR'), (u'than', u'IN'), (u'complex', u'NN')]
>>> b.noun_phrases
WordList([u'simple'])
>>> b.words
WordList([u'Simple', u'is', u'better', u'than', u'complex'])
>>> b.sentiment
(0.06666666666666667, 0.41904761904761906)
>>> b.words[0].synsets()[0]
Synset('simple.n.01')
Changed in version 0.8.0: These classes are now imported from textblob
rather than text.blob
.
-
class
textblob.blob.
BaseBlob
(text, tokenizer=None, pos_tagger=None, np_extractor=None, analyzer=None, parser=None, classifier=None, clean_html=False)[source]¶ An abstract base class that all textblob classes will inherit from. Includes words, POS tag, NP, and word count properties. Also includes basic dunder and string methods for making objects like Python strings.
Parameters: - text – A string.
- tokenizer – (optional) A tokenizer instance. If
None
, defaults toWordTokenizer()
. - np_extractor – (optional) An NPExtractor instance. If
None
, defaults toFastNPExtractor()
. - pos_tagger – (optional) A Tagger instance. If
None
, defaults toNLTKTagger
. - analyzer – (optional) A sentiment analyzer. If
None
, defaults toPatternAnalyzer
. - parser – A parser. If
None
, defaults toPatternParser
. - classifier – A classifier.
Changed in version 0.6.0:
clean_html
parameter deprecated, as it was in NLTK.-
correct
()[source]¶ Attempt to correct the spelling of a blob.
New in version 0.6.0.
Return type: BaseBlob
-
detect_language
()[source]¶ Detect the blob’s language using the Google Translate API.
Requires an internet connection.
Usage:
>>> b = TextBlob("bonjour") >>> b.detect_language() u'fr'
- Language code reference:
- https://developers.google.com/translate/v2/using_rest#language-params
Deprecated since version 0.16.0: Use the official Google Translate API instead.
New in version 0.5.0.
Return type: str
-
ends_with
(suffix, start=0, end=9223372036854775807)¶ Returns True if the blob ends with the given suffix.
-
endswith
(suffix, start=0, end=9223372036854775807)¶ Returns True if the blob ends with the given suffix.
-
find
(sub, start=0, end=9223372036854775807)¶ Behaves like the built-in str.find() method. Returns an integer, the index of the first occurrence of the substring argument sub in the sub-string given by [start:end].
-
format
(*args, **kwargs)¶ Perform a string formatting operation, like the built-in
str.format(*args, **kwargs)
. Returns a blob object.
-
index
(sub, start=0, end=9223372036854775807)¶ Like blob.find() but raise ValueError when the substring is not found.
-
join
(iterable)¶ Behaves like the built-in
str.join(iterable)
method, except returns a blob object.Returns a blob which is the concatenation of the strings or blobs in the iterable.
-
lower
()¶ Like str.lower(), returns new object with all lower-cased characters.
-
ngrams
(n=3)[source]¶ Return a list of n-grams (tuples of n successive words) for this blob.
Return type: List of WordLists
-
noun_phrases
¶ Returns a list of noun phrases for this blob.
-
np_counts
¶ Dictionary of noun phrase frequencies in this text.
-
parse
(parser=None)[source]¶ Parse the text.
Parameters: parser – (optional) A parser instance. If None
, defaults to this blob’s default parser.New in version 0.6.0.
-
polarity
¶ Return the polarity score as a float within the range [-1.0, 1.0]
Return type: float
Returns an list of tuples of the form (word, POS tag).
Example:
[('At', 'IN'), ('eight', 'CD'), ("o'clock", 'JJ'), ('on', 'IN'), ('Thursday', 'NNP'), ('morning', 'NN')]
Return type: list of tuples
-
replace
(old, new, count=9223372036854775807)¶ Return a new blob object with all the occurence of
old
replaced bynew
.
-
rfind
(sub, start=0, end=9223372036854775807)¶ Behaves like the built-in str.rfind() method. Returns an integer, the index of he last (right-most) occurence of the substring argument sub in the sub-sequence given by [start:end].
-
rindex
(sub, start=0, end=9223372036854775807)¶ Like blob.rfind() but raise ValueError when substring is not found.
-
sentiment
¶ Return a tuple of form (polarity, subjectivity ) where polarity is a float within the range [-1.0, 1.0] and subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective.
Return type: namedtuple of the form Sentiment(polarity, subjectivity)
-
sentiment_assessments
¶ Return a tuple of form (polarity, subjectivity, assessments ) where polarity is a float within the range [-1.0, 1.0], subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective, and assessments is a list of polarity and subjectivity scores for the assessed tokens.
Return type: namedtuple of the form ``Sentiment(polarity, subjectivity, assessments)``
-
split
(sep=None, maxsplit=9223372036854775807)[source]¶ Behaves like the built-in str.split() except returns a WordList.
Return type: WordList
-
starts_with
(prefix, start=0, end=9223372036854775807)¶ Returns True if the blob starts with the given prefix.
-
startswith
(prefix, start=0, end=9223372036854775807)¶ Returns True if the blob starts with the given prefix.
-
strip
(chars=None)¶ Behaves like the built-in str.strip([chars]) method. Returns an object with leading and trailing whitespace removed.
-
subjectivity
¶ Return the subjectivity score as a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective.
Return type: float
Returns an list of tuples of the form (word, POS tag).
Example:
[('At', 'IN'), ('eight', 'CD'), ("o'clock", 'JJ'), ('on', 'IN'), ('Thursday', 'NNP'), ('morning', 'NN')]
Return type: list of tuples
-
title
()¶ Returns a blob object with the text in title-case.
-
tokenize
(tokenizer=None)[source]¶ Return a list of tokens, using
tokenizer
.Parameters: tokenizer – (optional) A tokenizer object. If None, defaults to this blob’s default tokenizer.
-
tokens
¶ Return a list of tokens, using this blob’s tokenizer object (defaults to
WordTokenizer
).
-
translate
(from_lang=u'auto', to=u'en')[source]¶ Translate the blob to another language. Uses the Google Translate API. Returns a new TextBlob.
Requires an internet connection.
Usage:
>>> b = TextBlob("Simple is better than complex") >>> b.translate(to="es") TextBlob('Lo simple es mejor que complejo')
- Language code reference:
- https://developers.google.com/translate/v2/using_rest#language-params
Deprecated since version 0.16.0: Use the official Google Translate API instead.
New in version 0.5.0..
Parameters: - from_lang (str) – Language to translate from. If
None
, will attempt to detect the language. - to (str) – Language to translate to.
Return type:
-
upper
()¶ Like str.upper(), returns new object with all upper-cased characters.
-
word_counts
¶ Dictionary of word frequencies in this text.
-
class
textblob.blob.
Blobber
(tokenizer=None, pos_tagger=None, np_extractor=None, analyzer=None, parser=None, classifier=None)[source]¶ A factory for TextBlobs that all share the same tagger, tokenizer, parser, classifier, and np_extractor.
Usage:
>>> from textblob import Blobber >>> from textblob.taggers import NLTKTagger >>> from textblob.tokenizers import SentenceTokenizer >>> tb = Blobber(pos_tagger=NLTKTagger(), tokenizer=SentenceTokenizer()) >>> blob1 = tb("This is one blob.") >>> blob2 = tb("This blob has the same tagger and tokenizer.") >>> blob1.pos_tagger is blob2.pos_tagger True
Parameters: - tokenizer – (optional) A tokenizer instance. If
None
, defaults toWordTokenizer()
. - np_extractor – (optional) An NPExtractor instance. If
None
, defaults toFastNPExtractor()
. - pos_tagger – (optional) A Tagger instance. If
None
, defaults toNLTKTagger
. - analyzer – (optional) A sentiment analyzer. If
None
, defaults toPatternAnalyzer
. - parser – A parser. If
None
, defaults toPatternParser
. - classifier – A classifier.
New in version 0.4.0.
- tokenizer – (optional) A tokenizer instance. If
-
class
textblob.blob.
Sentence
(sentence, start_index=0, end_index=None, *args, **kwargs)[source]¶ A sentence within a TextBlob. Inherits from
BaseBlob
.Parameters: - sentence – A string, the raw sentence.
- start_index – An int, the index where this sentence begins in a TextBlob. If not given, defaults to 0.
- end_index – An int, the index where this sentence ends in a TextBlob. If not given, defaults to the length of the sentence - 1.
-
classify
()¶ Classify the blob using the blob’s
classifier
.
-
detect_language
()¶ Detect the blob’s language using the Google Translate API.
Requires an internet connection.
Usage:
>>> b = TextBlob("bonjour") >>> b.detect_language() u'fr'
- Language code reference:
- https://developers.google.com/translate/v2/using_rest#language-params
Deprecated since version 0.16.0: Use the official Google Translate API instead.
New in version 0.5.0.
Return type: str
-
dict
¶ The dict representation of this sentence.
-
end
= None¶ The end index within a textBlob
-
end_index
= None¶ The end index within a textBlob
-
ends_with
(suffix, start=0, end=9223372036854775807)¶ Returns True if the blob ends with the given suffix.
-
endswith
(suffix, start=0, end=9223372036854775807)¶ Returns True if the blob ends with the given suffix.
-
find
(sub, start=0, end=9223372036854775807)¶ Behaves like the built-in str.find() method. Returns an integer, the index of the first occurrence of the substring argument sub in the sub-string given by [start:end].
-
format
(*args, **kwargs)¶ Perform a string formatting operation, like the built-in
str.format(*args, **kwargs)
. Returns a blob object.
-
index
(sub, start=0, end=9223372036854775807)¶ Like blob.find() but raise ValueError when the substring is not found.
-
join
(iterable)¶ Behaves like the built-in
str.join(iterable)
method, except returns a blob object.Returns a blob which is the concatenation of the strings or blobs in the iterable.
-
lower
()¶ Like str.lower(), returns new object with all lower-cased characters.
-
ngrams
(n=3)¶ Return a list of n-grams (tuples of n successive words) for this blob.
Return type: List of WordLists
-
noun_phrases
¶ Returns a list of noun phrases for this blob.
-
np_counts
¶ Dictionary of noun phrase frequencies in this text.
-
parse
(parser=None)¶ Parse the text.
Parameters: parser – (optional) A parser instance. If None
, defaults to this blob’s default parser.New in version 0.6.0.
-
polarity
¶ Return the polarity score as a float within the range [-1.0, 1.0]
Return type: float
Returns an list of tuples of the form (word, POS tag).
Example:
[('At', 'IN'), ('eight', 'CD'), ("o'clock", 'JJ'), ('on', 'IN'), ('Thursday', 'NNP'), ('morning', 'NN')]
Return type: list of tuples
-
replace
(old, new, count=9223372036854775807)¶ Return a new blob object with all the occurence of
old
replaced bynew
.
-
rfind
(sub, start=0, end=9223372036854775807)¶ Behaves like the built-in str.rfind() method. Returns an integer, the index of he last (right-most) occurence of the substring argument sub in the sub-sequence given by [start:end].
-
rindex
(sub, start=0, end=9223372036854775807)¶ Like blob.rfind() but raise ValueError when substring is not found.
-
sentiment
¶ Return a tuple of form (polarity, subjectivity ) where polarity is a float within the range [-1.0, 1.0] and subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective.
Return type: namedtuple of the form Sentiment(polarity, subjectivity)
-
sentiment_assessments
¶ Return a tuple of form (polarity, subjectivity, assessments ) where polarity is a float within the range [-1.0, 1.0], subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective, and assessments is a list of polarity and subjectivity scores for the assessed tokens.
Return type: namedtuple of the form ``Sentiment(polarity, subjectivity, assessments)``
-
split
(sep=None, maxsplit=9223372036854775807)¶ Behaves like the built-in str.split() except returns a WordList.
Return type: WordList
-
start
= None¶ The start index within a TextBlob
-
start_index
= None¶ The start index within a TextBlob
-
starts_with
(prefix, start=0, end=9223372036854775807)¶ Returns True if the blob starts with the given prefix.
-
startswith
(prefix, start=0, end=9223372036854775807)¶ Returns True if the blob starts with the given prefix.
-
strip
(chars=None)¶ Behaves like the built-in str.strip([chars]) method. Returns an object with leading and trailing whitespace removed.
-
subjectivity
¶ Return the subjectivity score as a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective.
Return type: float
Returns an list of tuples of the form (word, POS tag).
Example:
[('At', 'IN'), ('eight', 'CD'), ("o'clock", 'JJ'), ('on', 'IN'), ('Thursday', 'NNP'), ('morning', 'NN')]
Return type: list of tuples
-
title
()¶ Returns a blob object with the text in title-case.
-
tokenize
(tokenizer=None)¶ Return a list of tokens, using
tokenizer
.Parameters: tokenizer – (optional) A tokenizer object. If None, defaults to this blob’s default tokenizer.
-
tokens
¶ Return a list of tokens, using this blob’s tokenizer object (defaults to
WordTokenizer
).
-
translate
(from_lang=u'auto', to=u'en')¶ Translate the blob to another language. Uses the Google Translate API. Returns a new TextBlob.
Requires an internet connection.
Usage:
>>> b = TextBlob("Simple is better than complex") >>> b.translate(to="es") TextBlob('Lo simple es mejor que complejo')
- Language code reference:
- https://developers.google.com/translate/v2/using_rest#language-params
Deprecated since version 0.16.0: Use the official Google Translate API instead.
New in version 0.5.0..
Parameters: - from_lang (str) – Language to translate from. If
None
, will attempt to detect the language. - to (str) – Language to translate to.
Return type:
-
upper
()¶ Like str.upper(), returns new object with all upper-cased characters.
-
word_counts
¶ Dictionary of word frequencies in this text.
-
class
textblob.blob.
TextBlob
(text, tokenizer=None, pos_tagger=None, np_extractor=None, analyzer=None, parser=None, classifier=None, clean_html=False)[source]¶ A general text block, meant for larger bodies of text (esp. those containing sentences). Inherits from
BaseBlob
.Parameters: - text (str) – A string.
- tokenizer – (optional) A tokenizer instance. If
None
, defaults toWordTokenizer()
. - np_extractor – (optional) An NPExtractor instance. If
None
, defaults toFastNPExtractor()
. - pos_tagger – (optional) A Tagger instance. If
None
, defaults toNLTKTagger
. - analyzer – (optional) A sentiment analyzer. If
None
, defaults toPatternAnalyzer
. - classifier – (optional) A classifier.
-
classify
()¶ Classify the blob using the blob’s
classifier
.
-
detect_language
()¶ Detect the blob’s language using the Google Translate API.
Requires an internet connection.
Usage:
>>> b = TextBlob("bonjour") >>> b.detect_language() u'fr'
- Language code reference:
- https://developers.google.com/translate/v2/using_rest#language-params
Deprecated since version 0.16.0: Use the official Google Translate API instead.
New in version 0.5.0.
Return type: str
-
ends_with
(suffix, start=0, end=9223372036854775807)¶ Returns True if the blob ends with the given suffix.
-
endswith
(suffix, start=0, end=9223372036854775807)¶ Returns True if the blob ends with the given suffix.
-
find
(sub, start=0, end=9223372036854775807)¶ Behaves like the built-in str.find() method. Returns an integer, the index of the first occurrence of the substring argument sub in the sub-string given by [start:end].
-
format
(*args, **kwargs)¶ Perform a string formatting operation, like the built-in
str.format(*args, **kwargs)
. Returns a blob object.
-
index
(sub, start=0, end=9223372036854775807)¶ Like blob.find() but raise ValueError when the substring is not found.
-
join
(iterable)¶ Behaves like the built-in
str.join(iterable)
method, except returns a blob object.Returns a blob which is the concatenation of the strings or blobs in the iterable.
-
json
¶ The json representation of this blob.
Changed in version 0.5.1: Made
json
a property instead of a method to restore backwards compatibility that was broken after version 0.4.0.
-
lower
()¶ Like str.lower(), returns new object with all lower-cased characters.
-
ngrams
(n=3)¶ Return a list of n-grams (tuples of n successive words) for this blob.
Return type: List of WordLists
-
noun_phrases
¶ Returns a list of noun phrases for this blob.
-
np_counts
¶ Dictionary of noun phrase frequencies in this text.
-
parse
(parser=None)¶ Parse the text.
Parameters: parser – (optional) A parser instance. If None
, defaults to this blob’s default parser.New in version 0.6.0.
-
polarity
¶ Return the polarity score as a float within the range [-1.0, 1.0]
Return type: float
Returns an list of tuples of the form (word, POS tag).
Example:
[('At', 'IN'), ('eight', 'CD'), ("o'clock", 'JJ'), ('on', 'IN'), ('Thursday', 'NNP'), ('morning', 'NN')]
Return type: list of tuples
-
raw_sentences
¶ List of strings, the raw sentences in the blob.
-
replace
(old, new, count=9223372036854775807)¶ Return a new blob object with all the occurence of
old
replaced bynew
.
-
rfind
(sub, start=0, end=9223372036854775807)¶ Behaves like the built-in str.rfind() method. Returns an integer, the index of he last (right-most) occurence of the substring argument sub in the sub-sequence given by [start:end].
-
rindex
(sub, start=0, end=9223372036854775807)¶ Like blob.rfind() but raise ValueError when substring is not found.
-
sentiment
¶ Return a tuple of form (polarity, subjectivity ) where polarity is a float within the range [-1.0, 1.0] and subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective.
Return type: namedtuple of the form Sentiment(polarity, subjectivity)
-
sentiment_assessments
¶ Return a tuple of form (polarity, subjectivity, assessments ) where polarity is a float within the range [-1.0, 1.0], subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective, and assessments is a list of polarity and subjectivity scores for the assessed tokens.
Return type: namedtuple of the form ``Sentiment(polarity, subjectivity, assessments)``
-
serialized
¶ Returns a list of each sentence’s dict representation.
-
split
(sep=None, maxsplit=9223372036854775807)¶ Behaves like the built-in str.split() except returns a WordList.
Return type: WordList
-
starts_with
(prefix, start=0, end=9223372036854775807)¶ Returns True if the blob starts with the given prefix.
-
startswith
(prefix, start=0, end=9223372036854775807)¶ Returns True if the blob starts with the given prefix.
-
strip
(chars=None)¶ Behaves like the built-in str.strip([chars]) method. Returns an object with leading and trailing whitespace removed.
-
subjectivity
¶ Return the subjectivity score as a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective.
Return type: float
Returns an list of tuples of the form (word, POS tag).
Example:
[('At', 'IN'), ('eight', 'CD'), ("o'clock", 'JJ'), ('on', 'IN'), ('Thursday', 'NNP'), ('morning', 'NN')]
Return type: list of tuples
-
title
()¶ Returns a blob object with the text in title-case.
-
to_json
(*args, **kwargs)[source]¶ Return a json representation (str) of this blob. Takes the same arguments as json.dumps.
New in version 0.5.1.
-
tokenize
(tokenizer=None)¶ Return a list of tokens, using
tokenizer
.Parameters: tokenizer – (optional) A tokenizer object. If None, defaults to this blob’s default tokenizer.
-
tokens
¶ Return a list of tokens, using this blob’s tokenizer object (defaults to
WordTokenizer
).
-
translate
(from_lang=u'auto', to=u'en')¶ Translate the blob to another language. Uses the Google Translate API. Returns a new TextBlob.
Requires an internet connection.
Usage:
>>> b = TextBlob("Simple is better than complex") >>> b.translate(to="es") TextBlob('Lo simple es mejor que complejo')
- Language code reference:
- https://developers.google.com/translate/v2/using_rest#language-params
Deprecated since version 0.16.0: Use the official Google Translate API instead.
New in version 0.5.0..
Parameters: - from_lang (str) – Language to translate from. If
None
, will attempt to detect the language. - to (str) – Language to translate to.
Return type:
-
upper
()¶ Like str.upper(), returns new object with all upper-cased characters.
-
word_counts
¶ Dictionary of word frequencies in this text.
-
class
textblob.blob.
Word
(string, pos_tag=None)[source]¶ A simple word representation. Includes methods for inflection, translation, and WordNet integration.
-
capitalize
() → unicode¶ Return a capitalized version of S, i.e. make the first character have upper case and the rest lower case.
-
center
(width[, fillchar]) → unicode¶ Return S centered in a Unicode string of length width. Padding is done using the specified fill character (default is a space)
-
correct
()[source]¶ Correct the spelling of the word. Returns the word with the highest confidence using the spelling corrector.
New in version 0.6.0.
-
count
(sub[, start[, end]]) → int¶ Return the number of non-overlapping occurrences of substring sub in Unicode string S[start:end]. Optional arguments start and end are interpreted as in slice notation.
-
decode
([encoding[, errors]]) → string or unicode¶ Decodes S using the codec registered for encoding. encoding defaults to the default encoding. errors may be given to set a different error handling scheme. Default is ‘strict’ meaning that encoding errors raise a UnicodeDecodeError. Other possible values are ‘ignore’ and ‘replace’ as well as any other name registered with codecs.register_error that is able to handle UnicodeDecodeErrors.
-
define
(pos=None)[source]¶ Return a list of definitions for this word. Each definition corresponds to a synset for this word.
Parameters: pos – A part-of-speech tag to filter upon. If None
, definitions for all parts of speech will be loaded.Return type: List of strings New in version 0.7.0.
-
definitions
¶ The list of definitions for this word. Each definition corresponds to a synset.
New in version 0.7.0.
-
detect_language
()[source]¶ Detect the word’s language using Google’s Translate API.
Deprecated since version 0.16.0: Use the official Google Translate API istead.
New in version 0.5.0.
-
encode
([encoding[, errors]]) → string or unicode¶ Encodes S using the codec registered for encoding. encoding defaults to the default encoding. errors may be given to set a different error handling scheme. Default is ‘strict’ meaning that encoding errors raise a UnicodeEncodeError. Other possible values are ‘ignore’, ‘replace’ and ‘xmlcharrefreplace’ as well as any other name registered with codecs.register_error that can handle UnicodeEncodeErrors.
-
endswith
(suffix[, start[, end]]) → bool¶ Return True if S ends with the specified suffix, False otherwise. With optional start, test S beginning at that position. With optional end, stop comparing S at that position. suffix can also be a tuple of strings to try.
-
expandtabs
([tabsize]) → unicode¶ Return a copy of S where all tab characters are expanded using spaces. If tabsize is not given, a tab size of 8 characters is assumed.
-
find
(sub[, start[, end]]) → int¶ Return the lowest index in S where substring sub is found, such that sub is contained within S[start:end]. Optional arguments start and end are interpreted as in slice notation.
Return -1 on failure.
-
format
(*args, **kwargs) → unicode¶ Return a formatted version of S, using substitutions from args and kwargs. The substitutions are identified by braces (‘{‘ and ‘}’).
-
get_synsets
(pos=None)[source]¶ Return a list of Synset objects for this word.
Parameters: pos – A part-of-speech tag to filter upon. If None
, all synsets for all parts of speech will be loaded.Return type: list of Synsets New in version 0.7.0.
-
index
(sub[, start[, end]]) → int¶ Like S.find() but raise ValueError when the substring is not found.
-
isalnum
() → bool¶ Return True if all characters in S are alphanumeric and there is at least one character in S, False otherwise.
-
isalpha
() → bool¶ Return True if all characters in S are alphabetic and there is at least one character in S, False otherwise.
-
isdecimal
() → bool¶ Return True if there are only decimal characters in S, False otherwise.
-
isdigit
() → bool¶ Return True if all characters in S are digits and there is at least one character in S, False otherwise.
-
islower
() → bool¶ Return True if all cased characters in S are lowercase and there is at least one cased character in S, False otherwise.
-
isnumeric
() → bool¶ Return True if there are only numeric characters in S, False otherwise.
-
isspace
() → bool¶ Return True if all characters in S are whitespace and there is at least one character in S, False otherwise.
-
istitle
() → bool¶ Return True if S is a titlecased string and there is at least one character in S, i.e. upper- and titlecase characters may only follow uncased characters and lowercase characters only cased ones. Return False otherwise.
-
isupper
() → bool¶ Return True if all cased characters in S are uppercase and there is at least one cased character in S, False otherwise.
-
join
(iterable) → unicode¶ Return a string which is the concatenation of the strings in the iterable. The separator between elements is S.
-
lemma
¶ Return the lemma of this word using Wordnet’s morphy function.
-
lemmatize
(**kwargs)[source]¶ Return the lemma for a word using WordNet’s morphy function.
Parameters: pos – Part of speech to filter upon. If None
, defaults to_wordnet.NOUN
.New in version 0.8.1.
-
ljust
(width[, fillchar]) → int¶ Return S left-justified in a Unicode string of length width. Padding is done using the specified fill character (default is a space).
-
lower
() → unicode¶ Return a copy of the string S converted to lowercase.
-
lstrip
([chars]) → unicode¶ Return a copy of the string S with leading whitespace removed. If chars is given and not None, remove characters in chars instead. If chars is a str, it will be converted to unicode before stripping
-
partition
(sep) -> (head, sep, tail)¶ Search for the separator sep in S, and return the part before it, the separator itself, and the part after it. If the separator is not found, return S and two empty strings.
-
replace
(old, new[, count]) → unicode¶ Return a copy of S with all occurrences of substring old replaced by new. If the optional argument count is given, only the first count occurrences are replaced.
-
rfind
(sub[, start[, end]]) → int¶ Return the highest index in S where substring sub is found, such that sub is contained within S[start:end]. Optional arguments start and end are interpreted as in slice notation.
Return -1 on failure.
-
rindex
(sub[, start[, end]]) → int¶ Like S.rfind() but raise ValueError when the substring is not found.
-
rjust
(width[, fillchar]) → unicode¶ Return S right-justified in a Unicode string of length width. Padding is done using the specified fill character (default is a space).
-
rpartition
(sep) -> (head, sep, tail)¶ Search for the separator sep in S, starting at the end of S, and return the part before it, the separator itself, and the part after it. If the separator is not found, return two empty strings and S.
-
rsplit
([sep[, maxsplit]]) → list of strings¶ Return a list of the words in S, using sep as the delimiter string, starting at the end of the string and working to the front. If maxsplit is given, at most maxsplit splits are done. If sep is not specified, any whitespace string is a separator.
-
rstrip
([chars]) → unicode¶ Return a copy of the string S with trailing whitespace removed. If chars is given and not None, remove characters in chars instead. If chars is a str, it will be converted to unicode before stripping
-
spellcheck
()[source]¶ Return a list of (word, confidence) tuples of spelling corrections.
Based on: Peter Norvig, “How to Write a Spelling Corrector” (http://norvig.com/spell-correct.html) as implemented in the pattern library.
New in version 0.6.0.
-
split
([sep[, maxsplit]]) → list of strings¶ Return a list of the words in S, using sep as the delimiter string. If maxsplit is given, at most maxsplit splits are done. If sep is not specified or is None, any whitespace string is a separator and empty strings are removed from the result.
-
splitlines
(keepends=False) → list of strings¶ Return a list of the lines in S, breaking at line boundaries. Line breaks are not included in the resulting list unless keepends is given and true.
-
startswith
(prefix[, start[, end]]) → bool¶ Return True if S starts with the specified prefix, False otherwise. With optional start, test S beginning at that position. With optional end, stop comparing S at that position. prefix can also be a tuple of strings to try.
-
stem
(stemmer=<PorterStemmer>)[source]¶ Stem a word using various NLTK stemmers. (Default: Porter Stemmer)
New in version 0.12.0.
-
strip
([chars]) → unicode¶ Return a copy of the string S with leading and trailing whitespace removed. If chars is given and not None, remove characters in chars instead. If chars is a str, it will be converted to unicode before stripping
-
swapcase
() → unicode¶ Return a copy of S with uppercase characters converted to lowercase and vice versa.
-
synsets
¶ The list of Synset objects for this Word.
Return type: list of Synsets New in version 0.7.0.
-
title
() → unicode¶ Return a titlecased version of S, i.e. words start with title case characters, all remaining cased characters have lower case.
-
translate
(from_lang=u'auto', to=u'en')[source]¶ Translate the word to another language using Google’s Translate API.
Deprecated since version 0.16.0: Use the official Google Translate API instead.
New in version 0.5.0.
-
upper
() → unicode¶ Return a copy of S converted to uppercase.
-
zfill
(width) → unicode¶ Pad a numeric string S with zeros on the left, to fill a field of the specified width. The string S is never truncated.
-
-
class
textblob.blob.
WordList
(collection)[source]¶ A list-like collection of words.
-
count
(strg, case_sensitive=False, *args, **kwargs)[source]¶ Get the count of a word or phrase
s
within this WordList.Parameters: - strg – The string to count.
- case_sensitive – A boolean, whether or not the search is case-sensitive.
-
extend
(iterable)[source]¶ Extend WordList by appending elements from
iterable
. If an element is a string, appends aWord
object.
-
index
(value[, start[, stop]]) → integer -- return first index of value.¶ Raises ValueError if the value is not present.
-
insert
()¶ L.insert(index, object) – insert object before index
-
pop
([index]) → item -- remove and return item at index (default last).¶ Raises IndexError if list is empty or index is out of range.
-
remove
()¶ L.remove(value) – remove first occurrence of value. Raises ValueError if the value is not present.
-
reverse
()¶ L.reverse() – reverse IN PLACE
-
sort
()¶ L.sort(cmp=None, key=None, reverse=False) – stable sort IN PLACE; cmp(x, y) -> -1, 0, 1
-
Base Classes¶
Abstract base classes for models (taggers, noun phrase extractors, etc.) which define the interface for descendant classes.
Changed in version 0.7.0: All base classes are defined in the same module, textblob.base
.
-
class
textblob.base.
BaseNPExtractor
[source]¶ Abstract base class from which all NPExtractor classes inherit. Descendant classes must implement an
extract(text)
method that returns a list of noun phrases as strings.
-
class
textblob.base.
BaseParser
[source]¶ Abstract parser class from which all parsers inherit from. All descendants must implement a
parse()
method.
-
class
textblob.base.
BaseSentimentAnalyzer
[source]¶ Abstract base class from which all sentiment analyzers inherit. Should implement an
analyze(text)
method which returns either the results of analysis.
-
class
textblob.base.
BaseTagger
[source]¶ Abstract tagger class from which all taggers inherit from. All descendants must implement a
tag()
method.
-
class
textblob.base.
BaseTokenizer
[source]¶ Abstract base class from which all Tokenizer classes inherit. Descendant classes must implement a
tokenize(text)
method that returns a list of noun phrases as strings.
Tokenizers¶
Various tokenizer implementations.
New in version 0.4.0.
-
class
textblob.tokenizers.
SentenceTokenizer
[source]¶ NLTK’s sentence tokenizer (currently PunktSentenceTokenizer). Uses an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences, then uses that to find sentence boundaries.
-
itokenize
(text, *args, **kwargs)¶ Return a generator that generates tokens “on-demand”.
New in version 0.6.0.
Return type: generator
-
-
class
textblob.tokenizers.
WordTokenizer
[source]¶ NLTK’s recommended word tokenizer (currently the TreeBankTokenizer). Uses regular expressions to tokenize text. Assumes text has already been segmented into sentences.
Performs the following steps:
- split standard contractions, e.g. don’t -> do n’t
- split commas and single quotes
- separate periods that appear at the end of line
-
itokenize
(text, *args, **kwargs)¶ Return a generator that generates tokens “on-demand”.
New in version 0.6.0.
Return type: generator
-
textblob.tokenizers.
sent_tokenize
= <bound method SentenceTokenizer.itokenize of <textblob.tokenizers.SentenceTokenizer object>>¶ Convenience function for tokenizing sentences
POS Taggers¶
Parts-of-speech tagger implementations.
-
class
textblob.en.taggers.
NLTKTagger
[source]¶ Tagger that uses NLTK’s standard TreeBank tagger. NOTE: Requires numpy. Not yet supported with PyPy.
-
class
textblob.en.taggers.
PatternTagger
[source]¶ Tagger that uses the implementation in Tom de Smedt’s pattern library (http://www.clips.ua.ac.be/pattern).
Noun Phrase Extractors¶
Various noun phrase extractors.
-
class
textblob.en.np_extractors.
ChunkParser
[source]¶ -
evaluate
(gold)¶ Score the accuracy of the chunker against the gold standard. Remove the chunking the gold standard text, rechunk it using the chunker, and return a
ChunkScore
object reflecting the performance of this chunk peraser.Parameters: gold (list(Tree)) – The list of chunked sentences to score the chunker on. Return type: ChunkScore
-
grammar
()¶ Returns: The grammar used by this parser.
-
parse_all
(sent, *args, **kwargs)¶ Return type: list(Tree)
-
parse_one
(sent, *args, **kwargs)¶ Return type: Tree or None
-
parse_sents
(sents, *args, **kwargs)¶ Apply
self.parse()
to each element ofsents
. :rtype: iter(iter(Tree))
-
Sentiment Analyzers¶
Sentiment analysis implementations.
New in version 0.5.0.
-
class
textblob.en.sentiments.
NaiveBayesAnalyzer
(feature_extractor=<function _default_feature_extractor>)[source]¶ Naive Bayes analyzer that is trained on a dataset of movie reviews. Returns results as a named tuple of the form:
Sentiment(classification, p_pos, p_neg)
Parameters: feature_extractor (callable) – Function that returns a dictionary of features, given a list of words. -
RETURN_TYPE
¶ Return type declaration
alias of
Sentiment
-
-
class
textblob.en.sentiments.
PatternAnalyzer
[source]¶ Sentiment analyzer that uses the same implementation as the pattern library. Returns results as a named tuple of the form:
Sentiment(polarity, subjectivity, [assessments])
where [assessments] is a list of the assessed tokens and their polarity and subjectivity scores
-
RETURN_TYPE
¶ alias of
Sentiment
-
Parsers¶
Various parser implementations.
New in version 0.6.0.
-
class
textblob.en.parsers.
PatternParser
[source]¶ Parser that uses the implementation in Tom de Smedt’s pattern library. http://www.clips.ua.ac.be/pages/pattern-en#parser
Classifiers¶
Various classifier implementations. Also includes basic feature extractor methods.
Example Usage:
>>> from textblob import TextBlob
>>> from textblob.classifiers import NaiveBayesClassifier
>>> train = [
... ('I love this sandwich.', 'pos'),
... ('This is an amazing place!', 'pos'),
... ('I feel very good about these beers.', 'pos'),
... ('I do not like this restaurant', 'neg'),
... ('I am tired of this stuff.', 'neg'),
... ("I can't deal with this", 'neg'),
... ("My boss is horrible.", "neg")
... ]
>>> cl = NaiveBayesClassifier(train)
>>> cl.classify("I feel amazing!")
'pos'
>>> blob = TextBlob("The beer is good. But the hangover is horrible.", classifier=cl)
>>> for s in blob.sentences:
... print(s)
... print(s.classify())
...
The beer is good.
pos
But the hangover is horrible.
neg
New in version 0.6.0.
-
class
textblob.classifiers.
BaseClassifier
(train_set, feature_extractor=<function basic_extractor>, format=None, **kwargs)[source]¶ Abstract classifier class from which all classifers inherit. At a minimum, descendant classes must implement a
classify
method and have aclassifier
property.Parameters: - train_set – The training set, either a list of tuples of the form
(text, classification)
or a file-like object.text
may be either a string or an iterable. - feature_extractor (callable) – A feature extractor function that takes one or
two arguments:
document
andtrain_set
. - format (str) – If
train_set
is a filename, the file format, e.g."csv"
or"json"
. IfNone
, will attempt to detect the file format. - kwargs – Additional keyword arguments are passed to the constructor
of the
Format
class used to read the data. Only applies when a file-like object is passed astrain_set
.
New in version 0.6.0.
-
classifier
¶ The classifier object.
- train_set – The training set, either a list of tuples of the form
-
class
textblob.classifiers.
DecisionTreeClassifier
(train_set, feature_extractor=<function basic_extractor>, format=None, **kwargs)[source]¶ A classifier based on the decision tree algorithm, as implemented in NLTK.
Parameters: - train_set – The training set, either a list of tuples of the form
(text, classification)
or a filename.text
may be either a string or an iterable. - feature_extractor – A feature extractor function that takes one or
two arguments:
document
andtrain_set
. - format – If
train_set
is a filename, the file format, e.g."csv"
or"json"
. IfNone
, will attempt to detect the file format.
New in version 0.6.2.
-
accuracy
(test_set, format=None)¶ Compute the accuracy on a test set.
Parameters: - test_set – A list of tuples of the form
(text, label)
, or a file pointer. - format – If
test_set
is a filename, the file format, e.g."csv"
or"json"
. IfNone
, will attempt to detect the file format.
- test_set – A list of tuples of the form
-
classifier
¶ The classifier.
-
classify
(text)¶ Classifies the text.
Parameters: text (str) – A string of text.
-
extract_features
(text)¶ Extracts features from a body of text.
Return type: dictionary of features
-
labels
()¶ Return an iterable of possible labels.
-
nltk_class
¶ alias of
nltk.classify.decisiontree.DecisionTreeClassifier
-
pprint
(*args, **kwargs)¶ Return a string containing a pretty-printed version of this decision tree. Each line in the string corresponds to a single decision tree node or leaf, and indentation is used to display the structure of the tree.
Return type: str
-
pretty_format
(*args, **kwargs)[source]¶ Return a string containing a pretty-printed version of this decision tree. Each line in the string corresponds to a single decision tree node or leaf, and indentation is used to display the structure of the tree.
Return type: str
-
pseudocode
(*args, **kwargs)[source]¶ Return a string representation of this decision tree that expresses the decisions it makes as a nested set of pseudocode if statements.
Return type: str
-
train
(*args, **kwargs)¶ Train the classifier with a labeled feature set and return the classifier. Takes the same arguments as the wrapped NLTK class. This method is implicitly called when calling
classify
oraccuracy
methods and is included only to allow passing in arguments to thetrain
method of the wrapped NLTK class.New in version 0.6.2.
Return type: A classifier
-
update
(new_data, *args, **kwargs)¶ Update the classifier with new training data and re-trains the classifier.
Parameters: new_data – New data as a list of tuples of the form (text, label)
.
- train_set – The training set, either a list of tuples of the form
-
class
textblob.classifiers.
MaxEntClassifier
(train_set, feature_extractor=<function basic_extractor>, format=None, **kwargs)[source]¶ A maximum entropy classifier (also known as a “conditional exponential classifier”). This classifier is parameterized by a set of “weights”, which are used to combine the joint-features that are generated from a featureset by an “encoding”. In particular, the encoding maps each
(featureset, label)
pair to a vector. The probability of each label is then computed using the following equation:dotprod(weights, encode(fs,label)) prob(fs|label) = --------------------------------------------------- sum(dotprod(weights, encode(fs,l)) for l in labels)
Where
dotprod
is the dot product:dotprod(a,b) = sum(x*y for (x,y) in zip(a,b))
-
accuracy
(test_set, format=None)¶ Compute the accuracy on a test set.
Parameters: - test_set – A list of tuples of the form
(text, label)
, or a file pointer. - format – If
test_set
is a filename, the file format, e.g."csv"
or"json"
. IfNone
, will attempt to detect the file format.
- test_set – A list of tuples of the form
-
classifier
¶ The classifier.
-
classify
(text)¶ Classifies the text.
Parameters: text (str) – A string of text.
-
extract_features
(text)¶ Extracts features from a body of text.
Return type: dictionary of features
-
labels
()¶ Return an iterable of possible labels.
-
nltk_class
¶ alias of
nltk.classify.maxent.MaxentClassifier
-
prob_classify
(text)[source]¶ Return the label probability distribution for classifying a string of text.
Example:
>>> classifier = MaxEntClassifier(train_data) >>> prob_dist = classifier.prob_classify("I feel happy this morning.") >>> prob_dist.max() 'positive' >>> prob_dist.prob("positive") 0.7
Return type: nltk.probability.DictionaryProbDist
-
train
(*args, **kwargs)¶ Train the classifier with a labeled feature set and return the classifier. Takes the same arguments as the wrapped NLTK class. This method is implicitly called when calling
classify
oraccuracy
methods and is included only to allow passing in arguments to thetrain
method of the wrapped NLTK class.New in version 0.6.2.
Return type: A classifier
-
update
(new_data, *args, **kwargs)¶ Update the classifier with new training data and re-trains the classifier.
Parameters: new_data – New data as a list of tuples of the form (text, label)
.
-
-
class
textblob.classifiers.
NLTKClassifier
(train_set, feature_extractor=<function basic_extractor>, format=None, **kwargs)[source]¶ An abstract class that wraps around the nltk.classify module.
Expects that descendant classes include a class variable
nltk_class
which is the class in the nltk.classify module to be wrapped.Example:
class MyClassifier(NLTKClassifier): nltk_class = nltk.classify.svm.SvmClassifier
-
accuracy
(test_set, format=None)[source]¶ Compute the accuracy on a test set.
Parameters: - test_set – A list of tuples of the form
(text, label)
, or a file pointer. - format – If
test_set
is a filename, the file format, e.g."csv"
or"json"
. IfNone
, will attempt to detect the file format.
- test_set – A list of tuples of the form
-
classifier
¶ The classifier.
-
extract_features
(text)¶ Extracts features from a body of text.
Return type: dictionary of features
-
nltk_class
= None¶ The NLTK class to be wrapped. Must be a class within nltk.classify
-
train
(*args, **kwargs)[source]¶ Train the classifier with a labeled feature set and return the classifier. Takes the same arguments as the wrapped NLTK class. This method is implicitly called when calling
classify
oraccuracy
methods and is included only to allow passing in arguments to thetrain
method of the wrapped NLTK class.New in version 0.6.2.
Return type: A classifier
-
-
class
textblob.classifiers.
NaiveBayesClassifier
(train_set, feature_extractor=<function basic_extractor>, format=None, **kwargs)[source]¶ A classifier based on the Naive Bayes algorithm, as implemented in NLTK.
Parameters: - train_set – The training set, either a list of tuples of the form
(text, classification)
or a filename.text
may be either a string or an iterable. - feature_extractor – A feature extractor function that takes one or
two arguments:
document
andtrain_set
. - format – If
train_set
is a filename, the file format, e.g."csv"
or"json"
. IfNone
, will attempt to detect the file format.
New in version 0.6.0.
-
accuracy
(test_set, format=None)¶ Compute the accuracy on a test set.
Parameters: - test_set – A list of tuples of the form
(text, label)
, or a file pointer. - format – If
test_set
is a filename, the file format, e.g."csv"
or"json"
. IfNone
, will attempt to detect the file format.
- test_set – A list of tuples of the form
-
classifier
¶ The classifier.
-
classify
(text)¶ Classifies the text.
Parameters: text (str) – A string of text.
-
extract_features
(text)¶ Extracts features from a body of text.
Return type: dictionary of features
-
informative_features
(*args, **kwargs)[source]¶ Return the most informative features as a list of tuples of the form
(feature_name, feature_value)
.Return type: list
-
labels
()¶ Return an iterable of possible labels.
-
nltk_class
¶ alias of
nltk.classify.naivebayes.NaiveBayesClassifier
-
prob_classify
(text)[source]¶ Return the label probability distribution for classifying a string of text.
Example:
>>> classifier = NaiveBayesClassifier(train_data) >>> prob_dist = classifier.prob_classify("I feel happy this morning.") >>> prob_dist.max() 'positive' >>> prob_dist.prob("positive") 0.7
Return type: nltk.probability.DictionaryProbDist
-
show_informative_features
(*args, **kwargs)[source]¶ Displays a listing of the most informative features for this classifier.
Return type: None
-
train
(*args, **kwargs)¶ Train the classifier with a labeled feature set and return the classifier. Takes the same arguments as the wrapped NLTK class. This method is implicitly called when calling
classify
oraccuracy
methods and is included only to allow passing in arguments to thetrain
method of the wrapped NLTK class.New in version 0.6.2.
Return type: A classifier
-
update
(new_data, *args, **kwargs)¶ Update the classifier with new training data and re-trains the classifier.
Parameters: new_data – New data as a list of tuples of the form (text, label)
.
- train_set – The training set, either a list of tuples of the form
-
class
textblob.classifiers.
PositiveNaiveBayesClassifier
(positive_set, unlabeled_set, feature_extractor=<function contains_extractor>, positive_prob_prior=0.5, **kwargs)[source]¶ A variant of the Naive Bayes Classifier that performs binary classification with partially-labeled training sets, i.e. when only one class is labeled and the other is not. Assuming a prior distribution on the two labels, uses the unlabeled set to estimate the frequencies of the features.
Example usage:
>>> from text.classifiers import PositiveNaiveBayesClassifier >>> sports_sentences = ['The team dominated the game', ... 'They lost the ball', ... 'The game was intense', ... 'The goalkeeper catched the ball', ... 'The other team controlled the ball'] >>> various_sentences = ['The President did not comment', ... 'I lost the keys', ... 'The team won the game', ... 'Sara has two kids', ... 'The ball went off the court', ... 'They had the ball for the whole game', ... 'The show is over'] >>> classifier = PositiveNaiveBayesClassifier(positive_set=sports_sentences, ... unlabeled_set=various_sentences) >>> classifier.classify("My team lost the game") True >>> classifier.classify("And now for something completely different.") False
Parameters: - positive_set – A collection of strings that have the positive label.
- unlabeled_set – A collection of unlabeled strings.
- feature_extractor – A feature extractor function.
- positive_prob_prior – A prior estimate of the probability of the
label
True
.
New in version 0.7.0.
-
accuracy
(test_set, format=None)¶ Compute the accuracy on a test set.
Parameters: - test_set – A list of tuples of the form
(text, label)
, or a file pointer. - format – If
test_set
is a filename, the file format, e.g."csv"
or"json"
. IfNone
, will attempt to detect the file format.
- test_set – A list of tuples of the form
-
classifier
¶ The classifier.
-
classify
(text)¶ Classifies the text.
Parameters: text (str) – A string of text.
-
extract_features
(text)¶ Extracts features from a body of text.
Return type: dictionary of features
-
labels
()¶ Return an iterable of possible labels.
-
train
(*args, **kwargs)[source]¶ Train the classifier with a labeled and unlabeled feature sets and return the classifier. Takes the same arguments as the wrapped NLTK class. This method is implicitly called when calling
classify
oraccuracy
methods and is included only to allow passing in arguments to thetrain
method of the wrapped NLTK class.Return type: A classifier
-
textblob.classifiers.
basic_extractor
(document, train_set)[source]¶ A basic document feature extractor that returns a dict indicating what words in
train_set
are contained indocument
.Parameters: - document – The text to extract features from. Can be a string or an iterable.
- train_set (list) – Training data set, a list of tuples of the form
(words, label)
OR an iterable of strings.
Blobber¶
-
class
textblob.blob.
Blobber
(tokenizer=None, pos_tagger=None, np_extractor=None, analyzer=None, parser=None, classifier=None)[source] A factory for TextBlobs that all share the same tagger, tokenizer, parser, classifier, and np_extractor.
Usage:
>>> from textblob import Blobber >>> from textblob.taggers import NLTKTagger >>> from textblob.tokenizers import SentenceTokenizer >>> tb = Blobber(pos_tagger=NLTKTagger(), tokenizer=SentenceTokenizer()) >>> blob1 = tb("This is one blob.") >>> blob2 = tb("This blob has the same tagger and tokenizer.") >>> blob1.pos_tagger is blob2.pos_tagger True
Parameters: - tokenizer – (optional) A tokenizer instance. If
None
, defaults toWordTokenizer()
. - np_extractor – (optional) An NPExtractor instance. If
None
, defaults toFastNPExtractor()
. - pos_tagger – (optional) A Tagger instance. If
None
, defaults toNLTKTagger
. - analyzer – (optional) A sentiment analyzer. If
None
, defaults toPatternAnalyzer
. - parser – A parser. If
None
, defaults toPatternParser
. - classifier – A classifier.
New in version 0.4.0.
-
__call__
(text)[source]¶ Return a new TextBlob object with this Blobber’s
np_extractor
,pos_tagger
,tokenizer
,analyzer
, andclassifier
.Returns: A new TextBlob
.
-
__init__
(tokenizer=None, pos_tagger=None, np_extractor=None, analyzer=None, parser=None, classifier=None)[source]¶ x.__init__(…) initializes x; see help(type(x)) for signature
-
__str__
()¶ x.__repr__() <==> repr(x)
- tokenizer – (optional) A tokenizer instance. If
File Formats¶
File formats for training and testing data.
Includes a registry of valid file formats. New file formats can be added to the registry like so:
from textblob import formats
class PipeDelimitedFormat(formats.DelimitedFormat):
delimiter = '|'
formats.register('psv', PipeDelimitedFormat)
Once a format has been registered, classifiers will be able to read data files with that format.
from textblob.classifiers import NaiveBayesAnalyzer
with open('training_data.psv', 'r') as fp:
cl = NaiveBayesAnalyzer(fp, format='psv')
-
class
textblob.formats.
BaseFormat
(fp, **kwargs)[source]¶ Interface for format classes. Individual formats can decide on the composition and meaning of
**kwargs
.Parameters: fp (File) – A file-like object. Changed in version 0.9.0: Constructor receives a file pointer rather than a file path.
-
class
textblob.formats.
CSV
(fp, **kwargs)[source]¶ CSV format. Assumes each row is of the form
text,label
.Today is a good day,pos I hate this car.,pos
-
classmethod
detect
(stream)¶ Return True if stream is valid.
-
to_iterable
()¶ Return an iterable object from the data.
-
classmethod
-
class
textblob.formats.
JSON
(fp, **kwargs)[source]¶ JSON format.
Assumes that JSON is formatted as an array of objects with
text
andlabel
properties.[ {"text": "Today is a good day.", "label": "pos"}, {"text": "I hate this car.", "label": "neg"} ]
-
class
textblob.formats.
TSV
(fp, **kwargs)[source]¶ TSV format. Assumes each row is of the form
text label
.-
classmethod
detect
(stream)¶ Return True if stream is valid.
-
to_iterable
()¶ Return an iterable object from the data.
-
classmethod
Wordnet¶
Exceptions¶
-
exception
textblob.exceptions.
MissingCorpusError
(message="nLooks like you are missing some required data for this feature.nnTo download the necessary data, simply runnn python -m textblob.download_corporannor use the NLTK downloader to download the missing data: http://nltk.org/data.htmlnIf this doesn't fix the problem, file an issue at https://github.com/sloria/TextBlob/issues.n", *args, **kwargs)[source]¶ Exception thrown when a user tries to use a feature that requires a dataset or model that the user does not have on their system.
-
exception
textblob.exceptions.
TranslatorError
[source]¶ Raised when an error occurs during language translation or detection.