Module 13: Texts¶
We'll use spaCy
and wordcloud
to play with text data. spaCy
is probably the best python package for analyzing text data. It's capable and super fast. Let's install them.
pip install wordcloud spacy
To use spaCy, you also need to download models. Run:
python -m spacy download en_core_web_sm
SpaCy basics¶
import spacy
import wordcloud
nlp = spacy.load('en_core_web_sm')
Usually the first step of text analysis is tokenization, which is the process of breaking a document into "tokens". You can roughly think of it as extracting each word.
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')
for token in doc:
print(token)
Apple is looking at buying U.K. startup for $ 1 billion
As you can see, it's not exactly same as doc.split()
. You'd want to have $
as a separate token because it has a particular meaning (USD). Actually, as shown in an example (https://spacy.io/usage/spacy-101#annotations-pos-deps), spaCy
figures out a lot of things about these tokens. For instance,
for token in doc:
print(token.text, token.lemma_, token.pos_, token.tag_)
Apple Apple PROPN NNP is be AUX VBZ looking look VERB VBG at at ADP IN buying buy VERB VBG U.K. U.K. PROPN NNP startup startup NOUN NN for for ADP IN $ $ SYM $ 1 1 NUM CD billion billion NUM CD
It figured it out that Apple
is a proper noun ("PROPN" and "NNP"; see here for the part of speech tags).
spaCy
has a visualizer too.
from spacy import displacy
displacy.render(doc, style='dep', jupyter=True, options={'distance': 100})
It even recognizes entities and can visualize them.
text = """But Google is starting from behind. The company made a late push
into hardware, and Apple’s Siri, available on iPhones, and Amazon’s Alexa
software, which runs on its Echo and Dot devices, have clear leads in
consumer adoption."""
doc2 = nlp(text)
displacy.render(doc2, style='ent', jupyter=True)
into hardware, and Apple ORG ’s Siri PERSON , available on iPhones ORG , and Amazon ORG ’s Alexa ORG
software, which runs on its Echo LOC and Dot devices, have clear leads in
consumer adoption.
Let's read a book¶
Shall we load some serious book? You can use any books that you can find as a text file.
import urllib.request
book = urllib.request.urlopen('https://sherlock-holm.es/stories/plain-text/stud.txt').read()
book[:1000]
b'\n\n\n\n A STUDY IN SCARLET\n\n Arthur Conan Doyle\n\n\n\n\n\n\n\n Table of contents\n\n Part I\n Mr. Sherlock Holmes\n The Science Of Deduction\n The Lauriston Garden Mystery\n What John Rance Had To Tell\n Our Advertisement Brings A Visitor\n Tobias Gregson Shows What He Can Do\n Light In The Darkness\n\n Part II\n On The Great Alkali Plain\n The Flower Of Utah\n John Ferrier Talks With The Prophet\n A Flight For Life\n The Avenging Angels\n A Continuation Of The Reminiscences Of John Watson, M.D.\n The Conclusion\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n PART I\n\n (Being a reprint from the reminiscences of\n John H. Watson, M.D.,\n late of the Army Medical Department.)\n\n\n\n\n\n CHAPTER I\n Mr. Sherlock Holmes\n\n\n In t'
Looks like we have successfully loaded the book. You'd probably want to remove the parts at the beginning and at the end that are not parts of the book if you are doing a serious analysis, but let's ignore them for now. Let's try to feed this directly into spaCy
.
doc = nlp(book)
--------------------------------------------------------------------------- ExtraData Traceback (most recent call last) Cell In[8], line 1 ----> 1 doc = nlp(book) File ~/git/dataviz-solutions/.venv/lib/python3.11/site-packages/spacy/language.py:1037, in Language.__call__(self, text, disable, component_cfg) 1016 def __call__( 1017 self, 1018 text: Union[str, Doc], (...) 1021 component_cfg: Optional[Dict[str, Dict[str, Any]]] = None, 1022 ) -> Doc: 1023 """Apply the pipeline to some text. The text can span multiple sentences, 1024 and can contain arbitrary whitespace. Alignment into the original string 1025 is preserved. (...) 1035 DOCS: https://spacy.io/api/language#call 1036 """ -> 1037 doc = self._ensure_doc(text) 1038 if component_cfg is None: 1039 component_cfg = {} File ~/git/dataviz-solutions/.venv/lib/python3.11/site-packages/spacy/language.py:1130, in Language._ensure_doc(self, doc_like) 1128 return self.make_doc(doc_like) 1129 if isinstance(doc_like, bytes): -> 1130 return Doc(self.vocab).from_bytes(doc_like) 1131 raise ValueError(Errors.E1041.format(type=type(doc_like))) File ~/git/dataviz-solutions/.venv/lib/python3.11/site-packages/spacy/tokens/doc.pyx:1359, in spacy.tokens.doc.Doc.from_bytes() File ~/git/dataviz-solutions/.venv/lib/python3.11/site-packages/srsly/_msgpack_api.py:27, in msgpack_loads(data, use_list) 25 # msgpack-python docs suggest disabling gc before unpacking large messages 26 gc.disable() ---> 27 msg = msgpack.loads(data, raw=False, use_list=use_list) 28 gc.enable() 29 return msg File ~/git/dataviz-solutions/.venv/lib/python3.11/site-packages/srsly/msgpack/__init__.py:79, in unpackb(packed, **kwargs) 77 object_hook = functools.partial(decoder, chain=object_hook) 78 kwargs["object_hook"] = object_hook ---> 79 return _unpackb(packed, **kwargs) File ~/git/dataviz-solutions/.venv/lib/python3.11/site-packages/srsly/msgpack/_unpacker.pyx:199, in srsly.msgpack._unpacker.unpackb() ExtraData: unpack(b) received extra data.
On encodings¶
What are we getting this error? What does it mean? It says nlp
function expects str
type but we passed bytes
.
type(book)
bytes
Indeed, the type of metamorphosis_book
is bytes
. But as we have seen above, we can see the book contents right? What's going on?
Well, the problem is that a byte sequence is not yet a proper string until we know how to decode it. A string is an abstract object and we need to specify an encoding to write the string into a file. For instance, if I have a string of Korean characters like "안녕", there are several encodings that I can specify to write that into a file, and depending on the encoding that I choose, the byte sequences can be totally different from each other. This is a really important (and confusing) topic, but because it's beyond the scope of the course, I'll just link a nice post about encoding: http://kunststube.net/encoding/
"안녕".encode('utf8')
b'\xec\x95\x88\xeb\x85\x95'
# b'\xec\x95\x88\xeb\x85\x95'.decode('euc-kr') <- what happen if you do this?
b'\xec\x95\x88\xeb\x85\x95'.decode('utf8')
'안녕'
"안녕".encode('euc-kr')
b'\xbe\xc8\xb3\xe7'
b'\xbe\xc8\xb3\xe7'.decode('euc-kr')
'안녕'
You can decode with "wrong" encoding too.
b'\xbe\xc8\xb3\xe7'.decode('latin-1')
'¾È³ç'
As you can see the same string can be encoded into different byte sequences depending on the encoding. It's a really annoying fun topic and if you need to deal with text data, you must have a good understanding of it.
There is a lot of complexity in encoding. But for now, just remember that utf-8
encoding is the most common encoding. It is also compatible with ASCII encoding as well. That means you can decode both ASCII and utf-8 documents with utf-8 encoding. So let's decode the byte sequence into a string.
# YOUR SOLUTION HERE
type(book_str)
str
Shall we try again?
doc = nlp(book_str)
words = [token.text for token in doc
if token.is_stop != True and token.is_punct != True]
Let's count!¶
from collections import Counter
Counter(words).most_common(5)
[('\n ', 3107), ('\n\n ', 772), ('said', 207), ('man', 155), ('Holmes', 98)]
a lot of newline characters and multiple spaces. A quick and dirty way to remove them is split & join. The idea is that you split the document using split()
and then join with a single space
. Can you implement it and print the 10 most common words?
# YOUR SOLUTION HERE
[('said', 207), ('man', 155), ('Holmes', 98), ('little', 80), ('time', 77), ('way', 69), ('came', 67), ('face', 67), ('asked', 65), ('come', 64)]
Let's keep the object with word count.
word_cnt = Counter(words)
Some wordclouds?¶
import matplotlib.pyplot as plt
Can you check out the wordcloud
package documentation and create a word cloud from the word count object that we created from the book above and plot it?
# Implement: create a word cloud object
# YOUR SOLUTION HERE
<wordcloud.wordcloud.WordCloud at 0x363a479d0>
# Implement: plot the word cloud object
# YOUR SOLUTION HERE