Want to learn more? Take the full course at [ Ссылка ] at your own pace. More than a video, you'll learn hands-on coding & quickly apply skills to your daily work.
--
You can count categorical data, but the text is still unstructured. We need a way to impose structure on the text, preferably in a way that is consistent with tidyverse principles so we can continue to use the functions we know and love.
The tidytext package does just that. Developed by Julia Silge and David Robinson, the tidytext package provides a suite of powerful tools that allow us to quickly and easily structure text and analyze it, taking full advantage of the tidyverse for text analysis.
We impose structure on text by splitting each review into separate words. In natural language processing or NLP circles, this is called a bag of words. We don’t care about the syntax or structure of the reviews, we’re simply cutting out each word in each review and mixing them up in a bag: a bag of words! Each separate body of text is a document; in this case, the reviews. Each unique word is known as a term. Every occurrence of a term known as a token; thus cutting up documents into words is known as tokenizing.
After loading the tidytext package, tokenizing is as simple as using the unnest_tokens() function. After specifying the input data frame, we provide the name of the column of words we're creating by tokenizing followed by the name of the column with the text we want to tokenize. In review_data that is the review column.
Instead of a column with a review in each row, we now have a column with a single word in each row. As a bonus, unnest_tokens() has done some cleaning for us: punctuation is gone, each word is lowercase, and white space has been removed. Having a single word per row means the total number of rows in the dataset has exploded from 1,833 to 229,481.
Now that we have imposed a tidy structure on the text, we can count words using the count() function. To make it easy to read the counts, we again use the arrange() verb, and the desc() helper function.
You shouldn’t be surprised to see that the most commonly used words are just common words like “the” that doesn’t give much insight into the content of the reviews. We need to do some additional cleaning before our word counts will be informative.
These common and uninformative words are known as stop words and we’d like to remove them from our tidied data frame. A set of functions in dplyr comes in handy. These are known as joins, and as the name suggests, they are used to join two data frames together based on one or more matching columns. The join we want is called an anti_join(). In an anti_join(), a row in the data frame on the left is retained as long as the value in the matching column isn’t shared by the data frame on the right.
Let’s illustrate this with review_data. A data frame of common stop words is available in the tidytext package as stop_words. If we pipe the tokenized review_data into anti_join(), so that the tokenized review data is the left data frame and stop_words is the right data frame, you can see that the number of rows has been drastically _reduced._
After again computing the counts and arranging them in descending order, we can now see that the most commonly used words in the product reviews reflect actual informative content.
We are standing on the shoulders of giants with the tidytext package, and this is just the beginning. Let’s give it a try!
Ещё видео!