API Reference¶

Blob Classes¶

Wrappers for various units of text, including the main TextBlob, Word, and WordList classes. Example usage:

>>> from textblob import TextBlob
>>> b = TextBlob("Simple is better than complex.")
>>> b.tags
[(u'Simple', u'NN'), (u'is', u'VBZ'), (u'better', u'JJR'), (u'than', u'IN'), (u'complex', u'NN')]
>>> b.noun_phrases
WordList([u'simple'])
>>> b.words
WordList([u'Simple', u'is', u'better', u'than', u'complex'])
>>> b.sentiment
(0.06666666666666667, 0.41904761904761906)
>>> b.words[0].synsets()[0]
Synset('simple.n.01')

Changed in version 0.8.0: These classes are now imported from textblob rather than text.blob.

class textblob.blob.BaseBlob(text, tokenizer=None, pos_tagger=None, np_extractor=None, analyzer=None, parser=None, classifier=None, clean_html=False)[source]¶

An abstract base class that all textblob classes will inherit from. Includes words, POS tag, NP, and word count properties. Also includes basic dunder and string methods for making objects like Python strings.

Parameters:

text – A string.
tokenizer – (optional) A tokenizer instance. If None, defaults to WordTokenizer().
np_extractor – (optional) An NPExtractor instance. If None, defaults to FastNPExtractor().
pos_tagger – (optional) A Tagger instance. If None, defaults to NLTKTagger.
analyzer – (optional) A sentiment analyzer. If None, defaults to PatternAnalyzer.
parser – A parser. If None, defaults to PatternParser.
classifier – A classifier.

Changed in version 0.6.0: clean_html parameter deprecated, as it was in NLTK.

classify()[source]¶: Classify the blob using the blob’s classifier.

correct()[source]¶

Attempt to correct the spelling of a blob.

Added in version 0.6.0.

Return type:: BaseBlob

ends_with(suffix, start=0, end=9223372036854775807)¶: Returns True if the blob ends with the given suffix.

endswith(suffix, start=0, end=9223372036854775807)¶: Returns True if the blob ends with the given suffix.

find(sub, start=0, end=9223372036854775807)¶: Behaves like the built-in str.find() method. Returns an integer, the index of the first occurrence of the substring argument sub in the sub-string given by [start:end].

format(*args, **kwargs)¶: Perform a string formatting operation, like the built-in str.format(*args, **kwargs). Returns a blob object.

index(sub, start=0, end=9223372036854775807)¶: Like blob.find() but raise ValueError when the substring is not found.

join(iterable)¶

Behaves like the built-in str.join(iterable) method, except returns a blob object.

Returns a blob which is the concatenation of the strings or blobs in the iterable.

lower()¶: Like str.lower(), returns new object with all lower-cased characters.

ngrams(n=3)[source]¶

Return a list of n-grams (tuples of n successive words) for this blob.

Return type:: List of WordLists

noun_phrases¶: Returns a list of noun phrases for this blob.

np_counts¶: Dictionary of noun phrase frequencies in this text.

parse(parser=None)[source]¶

Parse the text.

Parameters:: parser – (optional) A parser instance. If None, defaults to this blob’s default parser.

Added in version 0.6.0.

polarity¶

Return the polarity score as a float within the range [-1.0, 1.0]

Return type:: float

pos_tags¶

Returns an list of tuples of the form (word, POS tag).

Example:

[
    ("At", "IN"),
    ("eight", "CD"),
    ("o'clock", "JJ"),
    ("on", "IN"),
    ("Thursday", "NNP"),
    ("morning", "NN"),
]

Return type:: list of tuples

replace(old, new, count=9223372036854775807)¶: Return a new blob object with all occurrences of old replaced by new.

rfind(sub, start=0, end=9223372036854775807)¶: Behaves like the built-in str.rfind() method. Returns an integer, the index of the last (right-most) occurrence of the substring argument sub in the sub-sequence given by [start:end].

rindex(sub, start=0, end=9223372036854775807)¶: Like blob.rfind() but raise ValueError when substring is not found.

sentiment¶

Return a tuple of form (polarity, subjectivity ) where polarity is a float within the range [-1.0, 1.0] and subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective.

Return type:: namedtuple of the form Sentiment(polarity, subjectivity)

sentiment_assessments¶

Return a tuple of form (polarity, subjectivity, assessments ) where polarity is a float within the range [-1.0, 1.0], subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective, and assessments is a list of polarity and subjectivity scores for the assessed tokens.

Return type:: namedtuple of the form ``Sentiment(polarity, subjectivity,

assessments)``

split(sep=None, maxsplit=9223372036854775807)[source]¶

Behaves like the built-in str.split() except returns a WordList.

Return type:: WordList

starts_with(prefix, start=0, end=9223372036854775807)¶: Returns True if the blob starts with the given prefix.

startswith(prefix, start=0, end=9223372036854775807)¶: Returns True if the blob starts with the given prefix.

strip(chars=None)¶: Behaves like the built-in str.strip([chars]) method. Returns an object with leading and trailing whitespace removed.

subjectivity¶

Return the subjectivity score as a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective.

Return type:: float

tags¶

Returns an list of tuples of the form (word, POS tag).

Example:

[
    ("At", "IN"),
    ("eight", "CD"),
    ("o'clock", "JJ"),
    ("on", "IN"),
    ("Thursday", "NNP"),
    ("morning", "NN"),
]

Return type:: list of tuples

title()¶: Returns a blob object with the text in title-case.

tokenize(tokenizer=None)[source]¶

Return a list of tokens, using tokenizer.

Parameters:: tokenizer – (optional) A tokenizer object. If None, defaults to this blob’s default tokenizer.

tokens¶: Return a list of tokens, using this blob’s tokenizer object (defaults to WordTokenizer).

upper()¶: Like str.upper(), returns new object with all upper-cased characters.

word_counts¶: Dictionary of word frequencies in this text.

words¶

Return a list of word tokens. This excludes punctuation characters. If you want to include punctuation characters, access the tokens property.

Returns:: A WordList of word tokens.

class textblob.blob.Blobber(tokenizer=None, pos_tagger=None, np_extractor=None, analyzer=None, parser=None, classifier=None)[source]¶

A factory for TextBlobs that all share the same tagger, tokenizer, parser, classifier, and np_extractor.

Usage:

>>> from textblob import Blobber
>>> from textblob.taggers import NLTKTagger
>>> from textblob.tokenizers import SentenceTokenizer
>>> tb = Blobber(pos_tagger=NLTKTagger(), tokenizer=SentenceTokenizer())
>>> blob1 = tb("This is one blob.")
>>> blob2 = tb("This blob has the same tagger and tokenizer.")
>>> blob1.pos_tagger is blob2.pos_tagger
True

Parameters:

tokenizer – (optional) A tokenizer instance. If None, defaults to WordTokenizer().
np_extractor – (optional) An NPExtractor instance. If None, defaults to FastNPExtractor().
pos_tagger – (optional) A Tagger instance. If None, defaults to NLTKTagger.
analyzer – (optional) A sentiment analyzer. If None, defaults to PatternAnalyzer.
parser – A parser. If None, defaults to PatternParser.
classifier – A classifier.

Added in version 0.4.0.

class textblob.blob.Sentence(sentence, start_index=0, end_index=None, *args, **kwargs)[source]¶

A sentence within a TextBlob. Inherits from BaseBlob.

Parameters:

sentence – A string, the raw sentence.
start_index – An int, the index where this sentence begins in a TextBlob. If not given, defaults to 0.
end_index – An int, the index where this sentence ends in a TextBlob. If not given, defaults to the length of the sentence - 1.

classify()¶: Classify the blob using the blob’s classifier.

correct()¶

Attempt to correct the spelling of a blob.

Added in version 0.6.0.

Return type:: BaseBlob

property dict¶: The dict representation of this sentence.

end¶: The end index within a textBlob

end_index¶: The end index within a textBlob

ends_with(suffix, start=0, end=9223372036854775807)¶: Returns True if the blob ends with the given suffix.

endswith(suffix, start=0, end=9223372036854775807)¶: Returns True if the blob ends with the given suffix.

find(sub, start=0, end=9223372036854775807)¶: Behaves like the built-in str.find() method. Returns an integer, the index of the first occurrence of the substring argument sub in the sub-string given by [start:end].

format(*args, **kwargs)¶: Perform a string formatting operation, like the built-in str.format(*args, **kwargs). Returns a blob object.

index(sub, start=0, end=9223372036854775807)¶: Like blob.find() but raise ValueError when the substring is not found.

join(iterable)¶

Behaves like the built-in str.join(iterable) method, except returns a blob object.

Returns a blob which is the concatenation of the strings or blobs in the iterable.

lower()¶: Like str.lower(), returns new object with all lower-cased characters.

ngrams(n=3)¶

Return a list of n-grams (tuples of n successive words) for this blob.

Return type:: List of WordLists

noun_phrases¶: Returns a list of noun phrases for this blob.

np_counts¶: Dictionary of noun phrase frequencies in this text.

parse(parser=None)¶

Parse the text.

Parameters:: parser – (optional) A parser instance. If None, defaults to this blob’s default parser.

Added in version 0.6.0.

polarity¶

Return the polarity score as a float within the range [-1.0, 1.0]

Return type:: float

pos_tags¶

Returns an list of tuples of the form (word, POS tag).

Example:

[
    ("At", "IN"),
    ("eight", "CD"),
    ("o'clock", "JJ"),
    ("on", "IN"),
    ("Thursday", "NNP"),
    ("morning", "NN"),
]

Return type:: list of tuples

replace(old, new, count=9223372036854775807)¶: Return a new blob object with all occurrences of old replaced by new.

rfind(sub, start=0, end=9223372036854775807)¶: Behaves like the built-in str.rfind() method. Returns an integer, the index of the last (right-most) occurrence of the substring argument sub in the sub-sequence given by [start:end].

rindex(sub, start=0, end=9223372036854775807)¶: Like blob.rfind() but raise ValueError when substring is not found.

sentiment¶

Return a tuple of form (polarity, subjectivity ) where polarity is a float within the range [-1.0, 1.0] and subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective.

Return type:: namedtuple of the form Sentiment(polarity, subjectivity)

sentiment_assessments¶

Return a tuple of form (polarity, subjectivity, assessments ) where polarity is a float within the range [-1.0, 1.0], subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective, and assessments is a list of polarity and subjectivity scores for the assessed tokens.

Return type:: namedtuple of the form ``Sentiment(polarity, subjectivity,

assessments)``

split(sep=None, maxsplit=9223372036854775807)¶

Behaves like the built-in str.split() except returns a WordList.

Return type:: WordList

start¶: The start index within a TextBlob

start_index¶: The start index within a TextBlob

starts_with(prefix, start=0, end=9223372036854775807)¶: Returns True if the blob starts with the given prefix.

startswith(prefix, start=0, end=9223372036854775807)¶: Returns True if the blob starts with the given prefix.

strip(chars=None)¶: Behaves like the built-in str.strip([chars]) method. Returns an object with leading and trailing whitespace removed.

subjectivity¶

Return the subjectivity score as a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective.

Return type:: float

tags¶

Returns an list of tuples of the form (word, POS tag).

Example:

[
    ("At", "IN"),
    ("eight", "CD"),
    ("o'clock", "JJ"),
    ("on", "IN"),
    ("Thursday", "NNP"),
    ("morning", "NN"),
]

Return type:: list of tuples

title()¶: Returns a blob object with the text in title-case.

tokenize(tokenizer=None)¶

Return a list of tokens, using tokenizer.

Parameters:: tokenizer – (optional) A tokenizer object. If None, defaults to this blob’s default tokenizer.

tokens¶: Return a list of tokens, using this blob’s tokenizer object (defaults to WordTokenizer).

upper()¶: Like str.upper(), returns new object with all upper-cased characters.

word_counts¶: Dictionary of word frequencies in this text.

words¶

Return a list of word tokens. This excludes punctuation characters. If you want to include punctuation characters, access the tokens property.

Returns:: A WordList of word tokens.

class textblob.blob.TextBlob(text, tokenizer=None, pos_tagger=None, np_extractor=None, analyzer=None, parser=None, classifier=None, clean_html=False)[source]¶

A general text block, meant for larger bodies of text (esp. those containing sentences). Inherits from BaseBlob.

Parameters:

text (str) – A string.
tokenizer – (optional) A tokenizer instance. If None, defaults to WordTokenizer().
np_extractor – (optional) An NPExtractor instance. If None, defaults to FastNPExtractor().
pos_tagger – (optional) A Tagger instance. If None, defaults to NLTKTagger.
analyzer – (optional) A sentiment analyzer. If None, defaults to PatternAnalyzer.
classifier – (optional) A classifier.

classify()¶: Classify the blob using the blob’s classifier.

correct()¶

Attempt to correct the spelling of a blob.

Added in version 0.6.0.

Return type:: BaseBlob

ends_with(suffix, start=0, end=9223372036854775807)¶: Returns True if the blob ends with the given suffix.

endswith(suffix, start=0, end=9223372036854775807)¶: Returns True if the blob ends with the given suffix.

find(sub, start=0, end=9223372036854775807)¶: Behaves like the built-in str.find() method. Returns an integer, the index of the first occurrence of the substring argument sub in the sub-string given by [start:end].

format(*args, **kwargs)¶: Perform a string formatting operation, like the built-in str.format(*args, **kwargs). Returns a blob object.

index(sub, start=0, end=9223372036854775807)¶: Like blob.find() but raise ValueError when the substring is not found.

join(iterable)¶

Behaves like the built-in str.join(iterable) method, except returns a blob object.

Returns a blob which is the concatenation of the strings or blobs in the iterable.

property json¶: The json representation of this blob.

Changed in version 0.5.1: Made json a property instead of a method to restore backwards compatibility that was broken after version 0.4.0.

lower()¶: Like str.lower(), returns new object with all lower-cased characters.

ngrams(n=3)¶

Return a list of n-grams (tuples of n successive words) for this blob.

Return type:: List of WordLists

noun_phrases¶: Returns a list of noun phrases for this blob.

np_counts¶: Dictionary of noun phrase frequencies in this text.

parse(parser=None)¶

Parse the text.

Parameters:: parser – (optional) A parser instance. If None, defaults to this blob’s default parser.

Added in version 0.6.0.

polarity¶

Return the polarity score as a float within the range [-1.0, 1.0]

Return type:: float

pos_tags¶

Returns an list of tuples of the form (word, POS tag).

Example:

[
    ("At", "IN"),
    ("eight", "CD"),
    ("o'clock", "JJ"),
    ("on", "IN"),
    ("Thursday", "NNP"),
    ("morning", "NN"),
]

Return type:: list of tuples

property raw_sentences¶: List of strings, the raw sentences in the blob.

replace(old, new, count=9223372036854775807)¶: Return a new blob object with all occurrences of old replaced by new.

rfind(sub, start=0, end=9223372036854775807)¶: Behaves like the built-in str.rfind() method. Returns an integer, the index of the last (right-most) occurrence of the substring argument sub in the sub-sequence given by [start:end].

rindex(sub, start=0, end=9223372036854775807)¶: Like blob.rfind() but raise ValueError when substring is not found.

sentences¶: Return list of Sentence objects.

sentiment¶

Return a tuple of form (polarity, subjectivity ) where polarity is a float within the range [-1.0, 1.0] and subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective.

Return type:: namedtuple of the form Sentiment(polarity, subjectivity)

sentiment_assessments¶

Return a tuple of form (polarity, subjectivity, assessments ) where polarity is a float within the range [-1.0, 1.0], subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective, and assessments is a list of polarity and subjectivity scores for the assessed tokens.

Return type:: namedtuple of the form ``Sentiment(polarity, subjectivity,

assessments)``

property serialized¶: Returns a list of each sentence’s dict representation.

split(sep=None, maxsplit=9223372036854775807)¶

Behaves like the built-in str.split() except returns a WordList.

Return type:: WordList

starts_with(prefix, start=0, end=9223372036854775807)¶: Returns True if the blob starts with the given prefix.

startswith(prefix, start=0, end=9223372036854775807)¶: Returns True if the blob starts with the given prefix.

strip(chars=None)¶: Behaves like the built-in str.strip([chars]) method. Returns an object with leading and trailing whitespace removed.

subjectivity¶

Return the subjectivity score as a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective.

Return type:: float

tags¶

Returns an list of tuples of the form (word, POS tag).

Example:

[
    ("At", "IN"),
    ("eight", "CD"),
    ("o'clock", "JJ"),
    ("on", "IN"),
    ("Thursday", "NNP"),
    ("morning", "NN"),
]

Return type:: list of tuples

title()¶: Returns a blob object with the text in title-case.

to_json(*args, **kwargs)[source]¶: Return a json representation (str) of this blob. Takes the same arguments as json.dumps.

Added in version 0.5.1.

tokenize(tokenizer=None)¶

Return a list of tokens, using tokenizer.

Parameters:: tokenizer – (optional) A tokenizer object. If None, defaults to this blob’s default tokenizer.

tokens¶: Return a list of tokens, using this blob’s tokenizer object (defaults to WordTokenizer).

upper()¶: Like str.upper(), returns new object with all upper-cased characters.

word_counts¶: Dictionary of word frequencies in this text.

words¶

Return a list of word tokens. This excludes punctuation characters. If you want to include punctuation characters, access the tokens property.

Returns:: A WordList of word tokens.

class textblob.blob.Word(string, pos_tag=None)[source]¶

A simple word representation. Includes methods for inflection, and WordNet integration.

capitalize()¶

Return a capitalized version of the string.

More specifically, make the first character have upper case and the rest lower case.

casefold()¶: Return a version of the string suitable for caseless comparisons.

center(width, fillchar=' ', /)¶

Return a centered string of length width.

Padding is done using the specified fill character (default is a space).

correct()[source]¶: Correct the spelling of the word. Returns the word with the highest confidence using the spelling corrector.

Added in version 0.6.0.

count(sub[, start[, end]]) → int¶: Return the number of non-overlapping occurrences of substring sub in string S[start:end]. Optional arguments start and end are interpreted as in slice notation.

define(pos=None)[source]¶

Return a list of definitions for this word. Each definition corresponds to a synset for this word.

Parameters:: pos – A part-of-speech tag to filter upon. If None, definitions for all parts of speech will be loaded.
Return type:: List of strings

Added in version 0.7.0.

definitions¶: The list of definitions for this word. Each definition corresponds to a synset.

Added in version 0.7.0.

encode(encoding='utf-8', errors='strict')¶

Encode the string using the codec registered for encoding.

encoding: The encoding in which to encode the string.
errors: The error handling scheme to use for encoding errors. The default is ‘strict’ meaning that encoding errors raise a UnicodeEncodeError. Other possible values are ‘ignore’, ‘replace’ and ‘xmlcharrefreplace’ as well as any other name registered with codecs.register_error that can handle UnicodeEncodeErrors.

endswith(suffix[, start[, end]]) → bool¶: Return True if S ends with the specified suffix, False otherwise. With optional start, test S beginning at that position. With optional end, stop comparing S at that position. suffix can also be a tuple of strings to try.

expandtabs(tabsize=8)¶

Return a copy where all tab characters are expanded using spaces.

If tabsize is not given, a tab size of 8 characters is assumed.

find(sub[, start[, end]]) → int¶

Return the lowest index in S where substring sub is found, such that sub is contained within S[start:end]. Optional arguments start and end are interpreted as in slice notation.

Return -1 on failure.

format(*args, **kwargs) → str¶: Return a formatted version of S, using substitutions from args and kwargs. The substitutions are identified by braces (‘{’ and ‘}’).

format_map(mapping) → str¶: Return a formatted version of S, using substitutions from mapping. The substitutions are identified by braces (‘{’ and ‘}’).

get_synsets(pos=None)[source]¶

Return a list of Synset objects for this word.

Parameters:: pos – A part-of-speech tag to filter upon. If None, all synsets for all parts of speech will be loaded.
Return type:: list of Synsets

Added in version 0.7.0.

index(sub[, start[, end]]) → int¶

Return the lowest index in S where substring sub is found, such that sub is contained within S[start:end]. Optional arguments start and end are interpreted as in slice notation.

Raises ValueError when the substring is not found.

isalnum()¶

Return True if the string is an alpha-numeric string, False otherwise.

A string is alpha-numeric if all characters in the string are alpha-numeric and there is at least one character in the string.

isalpha()¶

Return True if the string is an alphabetic string, False otherwise.

A string is alphabetic if all characters in the string are alphabetic and there is at least one character in the string.

isascii()¶

Return True if all characters in the string are ASCII, False otherwise.

ASCII characters have code points in the range U+0000-U+007F. Empty string is ASCII too.

isdecimal()¶

Return True if the string is a decimal string, False otherwise.

A string is a decimal string if all characters in the string are decimal and there is at least one character in the string.

isdigit()¶

Return True if the string is a digit string, False otherwise.

A string is a digit string if all characters in the string are digits and there is at least one character in the string.

isidentifier()¶

Return True if the string is a valid Python identifier, False otherwise.

Call keyword.iskeyword(s) to test whether string s is a reserved identifier, such as “def” or “class”.

islower()¶

Return True if the string is a lowercase string, False otherwise.

A string is lowercase if all cased characters in the string are lowercase and there is at least one cased character in the string.

isnumeric()¶

Return True if the string is a numeric string, False otherwise.

A string is numeric if all characters in the string are numeric and there is at least one character in the string.

isprintable()¶

Return True if the string is printable, False otherwise.

A string is printable if all of its characters are considered printable in repr() or if it is empty.

isspace()¶

Return True if the string is a whitespace string, False otherwise.

A string is whitespace if all characters in the string are whitespace and there is at least one character in the string.

istitle()¶

Return True if the string is a title-cased string, False otherwise.

In a title-cased string, upper- and title-case characters may only follow uncased characters and lowercase characters only cased ones.

isupper()¶

Return True if the string is an uppercase string, False otherwise.

A string is uppercase if all cased characters in the string are uppercase and there is at least one cased character in the string.

join(iterable, /)¶

Concatenate any number of strings.

The string whose method is called is inserted in between each given string. The result is returned as a new string.

Example: ‘.’.join([‘ab’, ‘pq’, ‘rs’]) -> ‘ab.pq.rs’

lemma¶: Return the lemma of this word using Wordnet’s morphy function.

lemmatize(pos=None)[source]¶

Return the lemma for a word using WordNet’s morphy function.

Parameters:: pos – Part of speech to filter upon. If None, defaults to _wordnet.NOUN.

Added in version 0.8.1.

ljust(width, fillchar=' ', /)¶

Return a left-justified string of length width.

Padding is done using the specified fill character (default is a space).

lower()¶: Return a copy of the string converted to lowercase.

lstrip(chars=None, /)¶

Return a copy of the string with leading whitespace removed.

If chars is given and not None, remove characters in chars instead.

static maketrans()¶

Return a translation table usable for str.translate().

If there is only one argument, it must be a dictionary mapping Unicode ordinals (integers) or characters to Unicode ordinals, strings or None. Character keys will be then converted to ordinals. If there are two arguments, they must be strings of equal length, and in the resulting dictionary, each character in x will be mapped to the character at the same position in y. If there is a third argument, it must be a string, whose characters will be mapped to None in the result.

partition(sep, /)¶

Partition the string into three parts using the given separator.

This will search for the separator in the string. If the separator is found, returns a 3-tuple containing the part before the separator, the separator itself, and the part after it.

If the separator is not found, returns a 3-tuple containing the original string and two empty strings.

pluralize()[source]¶: Return the plural version of the word as a string.

removeprefix(prefix, /)¶

Return a str with the given prefix string removed if present.

If the string starts with the prefix string, return string[len(prefix):]. Otherwise, return a copy of the original string.

removesuffix(suffix, /)¶

Return a str with the given suffix string removed if present.

If the string ends with the suffix string and that suffix is not empty, return string[:-len(suffix)]. Otherwise, return a copy of the original string.

replace(old, new, count=-1, /)¶

Return a copy with all occurrences of substring old replaced by new.

count
Maximum number of occurrences to replace. -1 (the default value) means replace all occurrences.

If the optional argument count is given, only the first count occurrences are replaced.

rfind(sub[, start[, end]]) → int¶

Return the highest index in S where substring sub is found, such that sub is contained within S[start:end]. Optional arguments start and end are interpreted as in slice notation.

Return -1 on failure.

rindex(sub[, start[, end]]) → int¶

Return the highest index in S where substring sub is found, such that sub is contained within S[start:end]. Optional arguments start and end are interpreted as in slice notation.

Raises ValueError when the substring is not found.

rjust(width, fillchar=' ', /)¶

Return a right-justified string of length width.

Padding is done using the specified fill character (default is a space).

rpartition(sep, /)¶

Partition the string into three parts using the given separator.

This will search for the separator in the string, starting at the end. If the separator is found, returns a 3-tuple containing the part before the separator, the separator itself, and the part after it.

If the separator is not found, returns a 3-tuple containing two empty strings and the original string.

rsplit(sep=None, maxsplit=-1)¶

Return a list of the substrings in the string, using sep as the separator string.

sep
The separator used to split the string.

When set to None (the default value), will split on any whitespace character (including n r t f and spaces) and will discard empty strings from the result.

maxsplit
Maximum number of splits. -1 (the default value) means no limit.

Splitting starts at the end of the string and works to the front.

rstrip(chars=None, /)¶

Return a copy of the string with trailing whitespace removed.

If chars is given and not None, remove characters in chars instead.

singularize()[source]¶: Return the singular version of the word as a string.

spellcheck()[source]¶

Return a list of (word, confidence) tuples of spelling corrections.

Based on: Peter Norvig, “How to Write a Spelling Corrector” (http://norvig.com/spell-correct.html) as implemented in the pattern library.

Added in version 0.6.0.

split(sep=None, maxsplit=-1)¶

Return a list of the substrings in the string, using sep as the separator string.

sep
The separator used to split the string.

When set to None (the default value), will split on any whitespace character (including n r t f and spaces) and will discard empty strings from the result.

maxsplit
Maximum number of splits. -1 (the default value) means no limit.

Splitting starts at the front of the string and works to the end.

Note, str.split() is mainly useful for data that has been intentionally delimited. With natural text that includes punctuation, consider using the regular expression module.

splitlines(keepends=False)¶

Return a list of the lines in the string, breaking at line boundaries.

Line breaks are not included in the resulting list unless keepends is given and true.

startswith(prefix[, start[, end]]) → bool¶: Return True if S starts with the specified prefix, False otherwise. With optional start, test S beginning at that position. With optional end, stop comparing S at that position. prefix can also be a tuple of strings to try.

stem(stemmer=<PorterStemmer>)[source]¶: Stem a word using various NLTK stemmers. (Default: Porter Stemmer)

Added in version 0.12.0.

strip(chars=None, /)¶

Return a copy of the string with leading and trailing whitespace removed.

If chars is given and not None, remove characters in chars instead.

swapcase()¶: Convert uppercase characters to lowercase and lowercase characters to uppercase.

synsets¶

The list of Synset objects for this Word.

Return type:: list of Synsets

Added in version 0.7.0.

title()¶

Return a version of the string where each word is titlecased.

More specifically, words start with uppercased characters and all remaining cased characters have lower case.

translate(table, /)¶

Replace each character in the string using the given translation table.

table
Translation table, which must be a mapping of Unicode ordinals to Unicode ordinals, strings, or None.

The table must implement lookup/indexing via __getitem__, for instance a dictionary or list. If this operation raises LookupError, the character is left untouched. Characters mapped to None are deleted.

upper()¶: Return a copy of the string converted to uppercase.

zfill(width, /)¶

Pad a numeric string with zeros on the left, to fill a field of the given width.

The string is never truncated.

class textblob.blob.WordList(collection)[source]¶

A list-like collection of words.

append(obj)[source]¶: Append an object to end. If the object is a string, appends a Word object.

clear()¶: Remove all items from list.

copy()¶: Return a shallow copy of the list.

count(strg, case_sensitive=False, *args, **kwargs)[source]¶

Get the count of a word or phrase s within this WordList.

Parameters:

strg – The string to count.
case_sensitive – A boolean, whether or not the search is case-sensitive.

extend(iterable)[source]¶: Extend WordList by appending elements from iterable. If an element is a string, appends a Word object.

index(value, start=0, stop=9223372036854775807, /)¶

Return first index of value.

Raises ValueError if the value is not present.

insert(index, object, /)¶: Insert object before index.

lemmatize()[source]¶: Return the lemma of each word in this WordList.

lower()[source]¶: Return a new WordList with each word lower-cased.

pluralize()[source]¶: Return the plural version of each word in this WordList.

pop(index=-1, /)¶

Remove and return item at index (default last).

Raises IndexError if list is empty or index is out of range.

remove(value, /)¶

Remove first occurrence of value.

Raises ValueError if the value is not present.

reverse()¶: Reverse IN PLACE.

singularize()[source]¶: Return the single version of each word in this WordList.

sort(*, key=None, reverse=False)¶

Sort the list in ascending order and return None.

The sort is in-place (i.e. the list itself is modified) and stable (i.e. the order of two equal elements is maintained).

If a key function is given, apply it once to each list item and sort them, ascending or descending, according to their function values.

The reverse flag can be set to sort in descending order.

stem(*args, **kwargs)[source]¶: Return the stem for each word in this WordList.

upper()[source]¶: Return a new WordList with each word upper-cased.

Base Classes¶

Abstract base classes for models (taggers, noun phrase extractors, etc.) which define the interface for descendant classes.

Changed in version 0.7.0: All base classes are defined in the same module, textblob.base.

class textblob.base.BaseNPExtractor[source]¶

Abstract base class from which all NPExtractor classes inherit. Descendant classes must implement an extract(text) method that returns a list of noun phrases as strings.

abstractmethod extract(text: str) → list[str][source]¶: Return a list of noun phrases (strings) for a body of text.

class textblob.base.BaseParser[source]¶

Abstract parser class from which all parsers inherit from. All descendants must implement a parse() method.

abstractmethod parse(text: AnyStr)[source]¶: Parses the text.

class textblob.base.BaseSentimentAnalyzer[source]¶

Abstract base class from which all sentiment analyzers inherit. Should implement an analyze(text) method which returns either the results of analysis.

abstractmethod analyze(text) → Any[source]¶: Return the result of of analysis. Typically returns either a tuple, float, or dictionary.

class textblob.base.BaseTagger[source]¶

Abstract tagger class from which all taggers inherit from. All descendants must implement a tag() method.

abstractmethod tag(text: str, tokenize=True) → list[tuple[str, str]][source]¶: Return a list of tuples of the form (word, tag) for a given set of text or BaseBlob instance.

class textblob.base.BaseTokenizer[source]¶

Abstract base class from which all Tokenizer classes inherit. Descendant classes must implement a tokenize(text) method that returns a list of noun phrases as strings.

itokenize(text: str, *args, **kwargs)[source]¶

Return a generator that generates tokens “on-demand”.

Added in version 0.6.0.

Return type:: generator

abstractmethod tokenize(text: str) → list[str][source]¶

Return a list of tokens (strings) for a body of text.

Return type:: list

Tokenizers¶

Various tokenizer implementations.

Added in version 0.4.0.

class textblob.tokenizers.SentenceTokenizer[source]¶

NLTK’s sentence tokenizer (currently PunktSentenceTokenizer). Uses an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences, then uses that to find sentence boundaries.

itokenize(text: str, *args, **kwargs)¶

Return a generator that generates tokens “on-demand”.

Added in version 0.6.0.

Return type:: generator

span_tokenize(s: str) → Iterator[Tuple[int, int]]¶

Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token.

Return type:: Iterator[Tuple[int, int]]

span_tokenize_sents(strings: List[str]) → Iterator[List[Tuple[int, int]]]¶

Apply self.span_tokenize() to each element of strings. I.e.:

return [self.span_tokenize(s) for s in strings]

Yield:: List[Tuple[int, int]]

tokenize(text)[source]¶: Return a list of sentences.

tokenize_sents(strings: List[str]) → List[List[str]]¶

Apply self.tokenize() to each element of strings. I.e.:

return [self.tokenize(s) for s in strings]

Return type:: List[List[str]]

class textblob.tokenizers.WordTokenizer[source]¶

NLTK’s recommended word tokenizer (currently the TreeBankTokenizer). Uses regular expressions to tokenize text. Assumes text has already been segmented into sentences.

Performs the following steps:

split standard contractions, e.g. don’t -> do n’t
split commas and single quotes
separate periods that appear at the end of line

itokenize(text: str, *args, **kwargs)¶

Return a generator that generates tokens “on-demand”.

Added in version 0.6.0.

Return type:: generator

span_tokenize(s: str) → Iterator[Tuple[int, int]]¶

Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token.

Return type:: Iterator[Tuple[int, int]]

span_tokenize_sents(strings: List[str]) → Iterator[List[Tuple[int, int]]]¶

Apply self.span_tokenize() to each element of strings. I.e.:

return [self.span_tokenize(s) for s in strings]

Yield:: List[Tuple[int, int]]

tokenize(text, include_punc=True)[source]¶

Return a list of word tokens.

Parameters:

text – string of text.
include_punc – (optional) whether to include punctuation as separate tokens. Default to True.

tokenize_sents(strings: List[str]) → List[List[str]]¶

Apply self.tokenize() to each element of strings. I.e.:

return [self.tokenize(s) for s in strings]

Return type:: List[List[str]]

textblob.tokenizers.sent_tokenize(text: str, *args, **kwargs)¶: Convenience function for tokenizing sentences

textblob.tokenizers.word_tokenize(text, include_punc=True, *args, **kwargs)[source]¶

Convenience function for tokenizing text into words.

NOTE: NLTK’s word tokenizer expects sentences as input, so the text will be tokenized to sentences before being tokenized to words.

POS Taggers¶

Parts-of-speech tagger implementations.

class textblob.en.taggers.NLTKTagger[source]¶

Tagger that uses NLTK’s standard TreeBank tagger. NOTE: Requires numpy. Not yet supported with PyPy.

tag(text)[source]¶: Tag a string or BaseBlob.

class textblob.en.taggers.PatternTagger[source]¶

Tagger that uses the implementation in Tom de Smedt’s pattern library (http://www.clips.ua.ac.be/pattern).

tag(text, tokenize=True)[source]¶: Tag a string or BaseBlob.

Noun Phrase Extractors¶

Various noun phrase extractors.

class textblob.en.np_extractors.ChunkParser[source]¶

accuracy(gold)¶

Score the accuracy of the chunker against the gold standard. Remove the chunking the gold standard text, rechunk it using the chunker, and return a ChunkScore object reflecting the performance of this chunk parser.

Parameters:: gold (list(Tree)) – The list of chunked sentences to score the chunker on.
Return type:: ChunkScore

evaluate(**kwargs)¶: @deprecated: Use accuracy(gold) instead.

grammar()¶

Returns:: The grammar used by this parser.

parse(tokens)[source]¶: Return the parse tree for the sentence.

parse_all(sent, *args, **kwargs)¶

Return type:: list(Tree)

parse_one(sent, *args, **kwargs)¶

Return type:: Tree or None

parse_sents(sents, *args, **kwargs)¶: Apply self.parse() to each element of sents. :rtype: iter(iter(Tree))

train()[source]¶: Train the Chunker on the ConLL-2000 corpus.

class textblob.en.np_extractors.ConllExtractor(parser=None)[source]¶

A noun phrase extractor that uses chunk parsing trained with the ConLL-2000 training corpus.

extract(text)[source]¶: Return a list of noun phrases (strings) for body of text.

class textblob.en.np_extractors.FastNPExtractor[source]¶

A fast and simple noun phrase extractor.

Credit to Shlomi Babluk. Link to original blog post:

http://thetokenizer.com/2013/05/09/efficient-way-to-extract-the-main-topics-of-a-sentence/

extract(text)[source]¶: Return a list of noun phrases (strings) for body of text.

Sentiment Analyzers¶

Sentiment analysis implementations.

Added in version 0.5.0.

class textblob.en.sentiments.NaiveBayesAnalyzer(feature_extractor=<function _default_feature_extractor>)[source]¶

Naive Bayes analyzer that is trained on a dataset of movie reviews. Returns results as a named tuple of the form: Sentiment(classification, p_pos, p_neg)

Parameters:: feature_extractor (callable) – Function that returns a dictionary of features, given a list of words.

RETURN_TYPE¶

Return type declaration

alias of Sentiment

analyze(text)[source]¶: Return the sentiment as a named tuple of the form: Sentiment(classification, p_pos, p_neg)

train()[source]¶: Train the Naive Bayes classifier on the movie review corpus.

class textblob.en.sentiments.PatternAnalyzer[source]¶

Sentiment analyzer that uses the same implementation as the pattern library. Returns results as a named tuple of the form:

Sentiment(polarity, subjectivity, [assessments])

where [assessments] is a list of the assessed tokens and their polarity and subjectivity scores

RETURN_TYPE¶: alias of Sentiment

analyze(text, keep_assessments=False)[source]¶: Return the sentiment as a named tuple of the form: Sentiment(polarity, subjectivity, [assessments]).

Parsers¶

Various parser implementations.

Added in version 0.6.0.

class textblob.en.parsers.PatternParser[source]¶

Parser that uses the implementation in Tom de Smedt’s pattern library. http://www.clips.ua.ac.be/pages/pattern-en#parser

parse(text)[source]¶: Parses the text.

Classifiers¶

Various classifier implementations. Also includes basic feature extractor methods.

Example Usage:

>>> from textblob import TextBlob
>>> from textblob.classifiers import NaiveBayesClassifier
>>> train = [
...     ('I love this sandwich.', 'pos'),
...     ('This is an amazing place!', 'pos'),
...     ('I feel very good about these beers.', 'pos'),
...     ('I do not like this restaurant', 'neg'),
...     ('I am tired of this stuff.', 'neg'),
...     ("I can't deal with this", 'neg'),
...     ("My boss is horrible.", "neg")
... ]
>>> cl = NaiveBayesClassifier(train)
>>> cl.classify("I feel amazing!")
'pos'
>>> blob = TextBlob("The beer is good. But the hangover is horrible.", classifier=cl)
>>> for s in blob.sentences:
...     print(s)
...     print(s.classify())
...
The beer is good.
pos
But the hangover is horrible.
neg

Added in version 0.6.0.

class textblob.classifiers.BaseClassifier(train_set, feature_extractor=<function basic_extractor>, format=None, **kwargs)[source]¶

Abstract classifier class from which all classifers inherit. At a minimum, descendant classes must implement a classify method and have a classifier property.

Parameters:

train_set – The training set, either a list of tuples of the form (text, classification) or a file-like object. text may be either a string or an iterable.
feature_extractor (callable) – A feature extractor function that takes one or two arguments: document and train_set.
format (str) – If train_set is a filename, the file format, e.g. "csv" or "json". If None, will attempt to detect the file format.
kwargs – Additional keyword arguments are passed to the constructor of the Format class used to read the data. Only applies when a file-like object is passed as train_set.

Added in version 0.6.0.

classifier¶: The classifier object.

classify(text)[source]¶: Classifies a string of text.

extract_features(text)[source]¶

Extracts features from a body of text.

Return type:: dictionary of features

labels()[source]¶: Returns an iterable containing the possible labels.

train(labeled_featureset)[source]¶: Trains the classifier.

class textblob.classifiers.DecisionTreeClassifier(train_set, feature_extractor=<function basic_extractor>, format=None, **kwargs)[source]¶

A classifier based on the decision tree algorithm, as implemented in NLTK.

Parameters:

train_set – The training set, either a list of tuples of the form (text, classification) or a filename. text may be either a string or an iterable.
feature_extractor – A feature extractor function that takes one or two arguments: document and train_set.
format – If train_set is a filename, the file format, e.g. "csv" or "json". If None, will attempt to detect the file format.

Added in version 0.6.2.

accuracy(test_set, format=None)¶

Compute the accuracy on a test set.

Parameters:

test_set – A list of tuples of the form (text, label), or a file pointer.
format – If test_set is a filename, the file format, e.g. "csv" or "json". If None, will attempt to detect the file format.

classifier¶: The classifier.

classify(text)¶

Classifies the text.

Parameters:: text (str) – A string of text.

extract_features(text)¶

Extracts features from a body of text.

Return type:: dictionary of features

labels()¶: Return an iterable of possible labels.

nltk_class¶: alias of DecisionTreeClassifier

pprint(*args, **kwargs)¶

Return a string containing a pretty-printed version of this decision tree. Each line in the string corresponds to a single decision tree node or leaf, and indentation is used to display the structure of the tree.

Return type:: str

pretty_format(*args, **kwargs)[source]¶

Return a string containing a pretty-printed version of this decision tree. Each line in the string corresponds to a single decision tree node or leaf, and indentation is used to display the structure of the tree.

Return type:: str

pseudocode(*args, **kwargs)[source]¶

Return a string representation of this decision tree that expresses the decisions it makes as a nested set of pseudocode if statements.

Return type:: str

train(*args, **kwargs)¶

Train the classifier with a labeled feature set and return the classifier. Takes the same arguments as the wrapped NLTK class. This method is implicitly called when calling classify or accuracy methods and is included only to allow passing in arguments to the train method of the wrapped NLTK class.

Added in version 0.6.2.

Return type:: A classifier

update(new_data, *args, **kwargs)¶

Update the classifier with new training data and re-trains the classifier.

Parameters:: new_data – New data as a list of tuples of the form (text, label).

class textblob.classifiers.MaxEntClassifier(train_set, feature_extractor=<function basic_extractor>, format=None, **kwargs)[source]¶

A maximum entropy classifier (also known as a “conditional exponential classifier”). This classifier is parameterized by a set of “weights”, which are used to combine the joint-features that are generated from a featureset by an “encoding”. In particular, the encoding maps each (featureset, label) pair to a vector. The probability of each label is then computed using the following equation:

                          dotprod(weights, encode(fs,label))
prob(fs|label) = ---------------------------------------------------
                 sum(dotprod(weights, encode(fs,l)) for l in labels)

Where dotprod is the dot product:

dotprod(a,b) = sum(x*y for (x,y) in zip(a,b))

accuracy(test_set, format=None)¶

Compute the accuracy on a test set.

Parameters:

test_set – A list of tuples of the form (text, label), or a file pointer.
format – If test_set is a filename, the file format, e.g. "csv" or "json". If None, will attempt to detect the file format.

classifier¶: The classifier.

classify(text)¶

Classifies the text.

Parameters:: text (str) – A string of text.

extract_features(text)¶

Extracts features from a body of text.

Return type:: dictionary of features

labels()¶: Return an iterable of possible labels.

nltk_class¶: alias of MaxentClassifier

prob_classify(text)[source]¶

Return the label probability distribution for classifying a string of text.

Example:

>>> classifier = MaxEntClassifier(train_data)
>>> prob_dist = classifier.prob_classify("I feel happy this morning.")
>>> prob_dist.max()
'positive'
>>> prob_dist.prob("positive")
0.7

Return type:: nltk.probability.DictionaryProbDist

train(*args, **kwargs)¶

Train the classifier with a labeled feature set and return the classifier. Takes the same arguments as the wrapped NLTK class. This method is implicitly called when calling classify or accuracy methods and is included only to allow passing in arguments to the train method of the wrapped NLTK class.

Added in version 0.6.2.

Return type:: A classifier

update(new_data, *args, **kwargs)¶

Update the classifier with new training data and re-trains the classifier.

Parameters:: new_data – New data as a list of tuples of the form (text, label).

class textblob.classifiers.NLTKClassifier(train_set, feature_extractor=<function basic_extractor>, format=None, **kwargs)[source]¶

An abstract class that wraps around the nltk.classify module.

Expects that descendant classes include a class variable nltk_class which is the class in the nltk.classify module to be wrapped.

Example:

class MyClassifier(NLTKClassifier):
    nltk_class = nltk.classify.svm.SvmClassifier

accuracy(test_set, format=None)[source]¶

Compute the accuracy on a test set.

Parameters:

test_set – A list of tuples of the form (text, label), or a file pointer.
format – If test_set is a filename, the file format, e.g. "csv" or "json". If None, will attempt to detect the file format.

classifier¶: The classifier.

classify(text)[source]¶

Classifies the text.

Parameters:: text (str) – A string of text.

extract_features(text)¶

Extracts features from a body of text.

Return type:: dictionary of features

labels()[source]¶: Return an iterable of possible labels.

nltk_class = None¶: The NLTK class to be wrapped. Must be a class within nltk.classify

train(*args, **kwargs)[source]¶

Train the classifier with a labeled feature set and return the classifier. Takes the same arguments as the wrapped NLTK class. This method is implicitly called when calling classify or accuracy methods and is included only to allow passing in arguments to the train method of the wrapped NLTK class.

Added in version 0.6.2.

Return type:: A classifier

update(new_data, *args, **kwargs)[source]¶

Update the classifier with new training data and re-trains the classifier.

Parameters:: new_data – New data as a list of tuples of the form (text, label).

class textblob.classifiers.NaiveBayesClassifier(train_set, feature_extractor=<function basic_extractor>, format=None, **kwargs)[source]¶

A classifier based on the Naive Bayes algorithm, as implemented in NLTK.

Parameters:

train_set – The training set, either a list of tuples of the form (text, classification) or a filename. text may be either a string or an iterable.
feature_extractor – A feature extractor function that takes one or two arguments: document and train_set.
format – If train_set is a filename, the file format, e.g. "csv" or "json". If None, will attempt to detect the file format.

Added in version 0.6.0.

accuracy(test_set, format=None)¶

Compute the accuracy on a test set.

Parameters:

test_set – A list of tuples of the form (text, label), or a file pointer.
format – If test_set is a filename, the file format, e.g. "csv" or "json". If None, will attempt to detect the file format.

classifier¶: The classifier.

classify(text)¶

Classifies the text.

Parameters:: text (str) – A string of text.

extract_features(text)¶

Extracts features from a body of text.

Return type:: dictionary of features

informative_features(*args, **kwargs)[source]¶

Return the most informative features as a list of tuples of the form (feature_name, feature_value).

Return type:: list

labels()¶: Return an iterable of possible labels.

nltk_class¶: alias of NaiveBayesClassifier

prob_classify(text)[source]¶

Return the label probability distribution for classifying a string of text.

Example:

>>> classifier = NaiveBayesClassifier(train_data)
>>> prob_dist = classifier.prob_classify("I feel happy this morning.")
>>> prob_dist.max()
'positive'
>>> prob_dist.prob("positive")
0.7

Return type:: nltk.probability.DictionaryProbDist

show_informative_features(*args, **kwargs)[source]¶

Displays a listing of the most informative features for this classifier.

Return type:: None

train(*args, **kwargs)¶

Train the classifier with a labeled feature set and return the classifier. Takes the same arguments as the wrapped NLTK class. This method is implicitly called when calling classify or accuracy methods and is included only to allow passing in arguments to the train method of the wrapped NLTK class.

Added in version 0.6.2.

Return type:: A classifier

update(new_data, *args, **kwargs)¶

Update the classifier with new training data and re-trains the classifier.

Parameters:: new_data – New data as a list of tuples of the form (text, label).

class textblob.classifiers.PositiveNaiveBayesClassifier(positive_set, unlabeled_set, feature_extractor=<function contains_extractor>, positive_prob_prior=0.5, **kwargs)[source]¶

A variant of the Naive Bayes Classifier that performs binary classification with partially-labeled training sets, i.e. when only one class is labeled and the other is not. Assuming a prior distribution on the two labels, uses the unlabeled set to estimate the frequencies of the features.

Example usage:

>>> from text.classifiers import PositiveNaiveBayesClassifier
>>> sports_sentences = ['The team dominated the game',
...                   'They lost the ball',
...                   'The game was intense',
...                   'The goalkeeper catched the ball',
...                   'The other team controlled the ball']
>>> various_sentences = ['The President did not comment',
...                        'I lost the keys',
...                        'The team won the game',
...                        'Sara has two kids',
...                        'The ball went off the court',
...                        'They had the ball for the whole game',
...                        'The show is over']
>>> classifier = PositiveNaiveBayesClassifier(positive_set=sports_sentences,
...                                           unlabeled_set=various_sentences)
>>> classifier.classify("My team lost the game")
True
>>> classifier.classify("And now for something completely different.")
False

Parameters:

positive_set – A collection of strings that have the positive label.
unlabeled_set – A collection of unlabeled strings.
feature_extractor – A feature extractor function.
positive_prob_prior – A prior estimate of the probability of the label True.

Added in version 0.7.0.

accuracy(test_set, format=None)¶

Compute the accuracy on a test set.

Parameters:

test_set – A list of tuples of the form (text, label), or a file pointer.
format – If test_set is a filename, the file format, e.g. "csv" or "json". If None, will attempt to detect the file format.

classifier¶: The classifier.

classify(text)¶

Classifies the text.

Parameters:: text (str) – A string of text.

extract_features(text)¶

Extracts features from a body of text.

Return type:: dictionary of features

labels()¶: Return an iterable of possible labels.

nltk_class¶: alias of PositiveNaiveBayesClassifier

train(*args, **kwargs)[source]¶

Train the classifier with a labeled and unlabeled feature sets and return the classifier. Takes the same arguments as the wrapped NLTK class. This method is implicitly called when calling classify or accuracy methods and is included only to allow passing in arguments to the train method of the wrapped NLTK class.

Return type:: A classifier

update(new_positive_data=None, new_unlabeled_data=None, positive_prob_prior=0.5, *args, **kwargs)[source]¶

Update the classifier with new data and re-trains the classifier.

Parameters:

new_positive_data – List of new, labeled strings.
new_unlabeled_data – List of new, unlabeled strings.

textblob.classifiers.basic_extractor(document, train_set)[source]¶

A basic document feature extractor that returns a dict indicating what words in train_set are contained in document.

Parameters:

document – The text to extract features from. Can be a string or an iterable.
train_set (list) – Training data set, a list of tuples of the form (words, label) OR an iterable of strings.

textblob.classifiers.contains_extractor(document)[source]¶: A basic document feature extractor that returns a dict of words that the document contains.

Blobber¶

class textblob.blob.Blobber(tokenizer=None, pos_tagger=None, np_extractor=None, analyzer=None, parser=None, classifier=None)[source]¶

A factory for TextBlobs that all share the same tagger, tokenizer, parser, classifier, and np_extractor.

Usage:

>>> from textblob import Blobber
>>> from textblob.taggers import NLTKTagger
>>> from textblob.tokenizers import SentenceTokenizer
>>> tb = Blobber(pos_tagger=NLTKTagger(), tokenizer=SentenceTokenizer())
>>> blob1 = tb("This is one blob.")
>>> blob2 = tb("This blob has the same tagger and tokenizer.")
>>> blob1.pos_tagger is blob2.pos_tagger
True

Parameters:

tokenizer – (optional) A tokenizer instance. If None, defaults to WordTokenizer().
np_extractor – (optional) An NPExtractor instance. If None, defaults to FastNPExtractor().
pos_tagger – (optional) A Tagger instance. If None, defaults to NLTKTagger.
analyzer – (optional) A sentiment analyzer. If None, defaults to PatternAnalyzer.
parser – A parser. If None, defaults to PatternParser.
classifier – A classifier.

Added in version 0.4.0.

__call__(text)[source]¶

Return a new TextBlob object with this Blobber’s np_extractor, pos_tagger, tokenizer, analyzer, and classifier.

Returns:: A new TextBlob.

__init__(tokenizer=None, pos_tagger=None, np_extractor=None, analyzer=None, parser=None, classifier=None)[source]¶

__repr__()[source]¶: Return repr(self).

__str__()¶: Return str(self).

File Formats¶

File formats for training and testing data.

Includes a registry of valid file formats. New file formats can be added to the registry like so:

from textblob import formats

class PipeDelimitedFormat(formats.DelimitedFormat):
    delimiter = "|"

formats.register("psv", PipeDelimitedFormat)

Once a format has been registered, classifiers will be able to read data files with that format.

from textblob.classifiers import NaiveBayesAnalyzer

with open("training_data.psv", "r") as fp:
    cl = NaiveBayesAnalyzer(fp, format="psv")

class textblob.formats.BaseFormat(fp, **kwargs)[source]¶

Interface for format classes. Individual formats can decide on the composition and meaning of **kwargs.

Parameters:: fp (File) – A file-like object.

Changed in version 0.9.0: Constructor receives a file pointer rather than a file path.

classmethod detect(stream: str)[source]¶: Detect the file format given a filename. Return True if a stream is this file format.

Changed in version 0.9.0: Changed from a static method to a class method.

to_iterable()[source]¶: Return an iterable object from the data.

class textblob.formats.CSV(fp, **kwargs)[source]¶

CSV format. Assumes each row is of the form text,label.

Today is a good day,pos
I hate this car.,pos

classmethod detect(stream)¶: Return True if stream is valid.

to_iterable()¶: Return an iterable object from the data.

class textblob.formats.DelimitedFormat(fp, **kwargs)[source]¶

A general character-delimited format.

classmethod detect(stream)[source]¶: Return True if stream is valid.

to_iterable()[source]¶: Return an iterable object from the data.

class textblob.formats.JSON(fp, **kwargs)[source]¶

JSON format.

Assumes that JSON is formatted as an array of objects with text and label properties.

[
    {"text": "Today is a good day.", "label": "pos"},
    {"text": "I hate this car.", "label": "neg"},
]

classmethod detect(stream: str | bytes | bytearray)[source]¶: Return True if stream is valid JSON.

to_iterable()[source]¶: Return an iterable object from the JSON data.

class textblob.formats.TSV(fp, **kwargs)[source]¶

TSV format. Assumes each row is of the form text label.

classmethod detect(stream)¶: Return True if stream is valid.

to_iterable()¶: Return an iterable object from the data.

textblob.formats.detect(fp, max_read=1024)[source]¶: Attempt to detect a file’s format, trying each of the supported formats. Return the format class that was detected. If no format is detected, return None.

textblob.formats.get_registry()[source]¶: Return a dictionary of registered formats.

textblob.formats.register(name, format_class)[source]¶

Register a new format.

Parameters:

name (str) – The name that will be used to refer to the format, e.g. ‘csv’
format_class (type) – The format class to register.

Wordnet¶

Exceptions¶

exception textblob.exceptions.TextBlobError[source]¶: A TextBlob-related error.

exception textblob.exceptions.MissingCorpusError(message="\nLooks like you are missing some required data for this feature.\n\nTo download the necessary data, simply run\n\n python -m textblob.download_corpora\n\nor use the NLTK downloader to download the missing data: http://nltk.org/data.html\nIf this doesn't fix the problem, file an issue at https://github.com/sloria/TextBlob/issues.\n", *args, **kwargs)[source]¶: Exception thrown when a user tries to use a feature that requires a dataset or model that the user does not have on their system.

exception textblob.exceptions.DeprecationError[source]¶: Raised when user uses a deprecated feature.

exception textblob.exceptions.TranslatorError[source]¶: Raised when an error occurs during language translation or detection.

exception textblob.exceptions.NotTranslated[source]¶: Raised when text is unchanged after translation. This may be due to the language being unsupported by the translator.

exception textblob.exceptions.FormatError[source]¶: Raised if a data file with an unsupported format is passed to a classifier.

API Reference¶

Blob Classes¶

Base Classes¶

Tokenizers¶

POS Taggers¶

Noun Phrase Extractors¶

Sentiment Analyzers¶

Parsers¶

Classifiers¶

Blobber¶

File Formats¶

Wordnet¶

Exceptions¶

Useful Links

Table of Contents

Related Topics