26

Manipulating strings with the {stringr} package

 5 years ago
source link: https://www.tuicool.com/articles/hit/FfqmYz3
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

(This article was first published on Econometrics and Free Software , and kindly contributed toR-bloggers)

string.jpg?w=456

This blog post is an excerpt of my ebook Modern R with the tidyverse that you can read for

free here . This is taken from Chapter 4,

in which I introduce the {stringr} package.

Manipulate strings with {stringr}

{stringr} contains functions to manipulate strings. In Chapter 10, I will teach you about regular

expressions, but the functions contained in {stringr} allow you to already do a lot of work on

strings, without needing to be a regular expression expert.

I will discuss the most common string operations: detecting, locating, matching, searching and

replacing, and exctracting/removing strings.

To introduce these operations, let us use an ALTO file of an issue of The Winchester News from

October 31, 1910, which you can find on this

link (to see

how the newspaper looked like,

click here ). I re-hosted

the file on a public gist for archiving purposes. While working on the book, the original site went

down several times…

ALTO is an XML schema for the description of text OCR and layout information of pages for digitzed

material, such as newspapers (source: ALTO Wikipedia page ).

For more details, you can read my

blogpost

on the matter, but for our current purposes, it is enough to know that the file contains the text

of newspaper articles. The file looks like this:

<textline height="138.0" width="450" hpos="4056.0" vpos="5814.0">

 <string stylerefs="ID7" height="108.0" width="393.0" hpos="4056.0" vpos="5838.0" content="timore" wc="0.82539684">

  <alternative>
   timole
  </alternative>

  <alternative>
   tlnldre
  </alternative>

  <alternative>
   timor
  </alternative>

  <alternative>
   insole
  </alternative>

  <alternative>
   landed
  </alternative>

 </string>

 <sp width="74.0" hpos="4449.0" vpos="5838.0" />

 <string stylerefs="ID7" height="105.0" width="432.0" hpos="4524.0" vpos="5847.0" content="market" wc="0.95238096" />

 <sp width="116.0" hpos="4956.0" vpos="5847.0" />

 <string stylerefs="ID7" height="69.0" width="138.0" hpos="5073.0" vpos="5883.0" content="as" wc="0.96825397" />

 <sp width="74.0" hpos="5211.0" vpos="5883.0" />

 <string stylerefs="ID7" height="69.0" width="285.0" hpos="5286.0" vpos="5877.0" content="were" wc="1.0">

  <alternative>
   verc
  </alternative>

  <alternative>
   veer
  </alternative>

 </string>

 <sp width="68.0" hpos="5571.0" vpos="5877.0" />

 <string stylerefs="ID7" height="111.0" width="147.0" hpos="5640.0" vpos="5838.0" content="all" wc="1.0" />

 <sp width="83.0" hpos="5787.0" vpos="5838.0" />

 <string stylerefs="ID7" height="111.0" width="183.0" hpos="5871.0" vpos="5835.0" content="the" wc="0.95238096">

  <alternative>
   tll
  </alternative>

  <alternative>
   Cu
  </alternative>

  <alternative>
   tall
  </alternative>

 </string>

 <sp width="75.0" hpos="6054.0" vpos="5835.0" />

 <string stylerefs="ID3" height="132.0" width="351.0" hpos="6129.0" vpos="5814.0" content="cattle" wc="0.95238096" />

</textline>
We are interested in the strings after CONTENT= . We are going to use functions from the {stringr}

package to get the strings after CONTENT= . In Chapter 10, we are going to explore this file

again, but using complex regular expressions to get all the content in one go.

Getting text data into Rstudio

First of all, let us read in the file:

winchester <- read_lines("https://gist.githubusercontent.com/b-rodrigues/5139560e7d0f2ecebe5da1df3629e015/raw/e3031d894ffb97217ddbad1ade1b307c9937d2c8/gistfile1.txt")
Even though the file is an XML file, I still read it in using read_lines() and not read_xml()

from the {xml2} package. This is for the purposes of the current exercise, and also because I

always have trouble with XML files, and prefer to treat them as simple text files, and use regular

expressions to get what I need.

Now that the ALTO file is read in and saved in the winchester variable, you might want to print

the whole thing in the console. Before that, take a look at the structure:

str(winchester)
##  chr [1:43] "" ...

So the winchester variable is a character atomic vector with 43 elements. So first, we need to

understand what these elements are. Let’s start with the first one:

winchester[1]
## [1] ""

Ok, so it seems like the first element is part of the header of the file. What about the second one?

winchester[2]
## [1] "

This is Google's cache of ## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE

This returns a boolean atomic vector of the same length as winchester . If the string CONTENT is

nowhere to be found, the result will equal FALSE , if not it will equal TRUE . Here it is easy to

see that the last element contains the string CONTENT . But what if instead of having 43 elements,

the vector had 24192 elements? And hundreds would contain the string CONTENT ? It would be easier

to instead have the indices of the vector where one can find the word CONTENT . This is possible

with str_which() :

winchester %>%
  str_which("CONTENT")
## [1] 43
ancient_philosophers <- c("aristotle", "plato", "epictetus", "seneca the younger", "epicurus", "marcus aurelius")

Now suppose I am interested in philosophers whose name ends in us . Let us use str_locate() first:

ancient_philosophers %>%
  str_locate("us")
##      start end
## [1,]    NA  NA
## [2,]    NA  NA
## [3,]     8   9
## [4,]    NA  NA
## [5,]     7   8
## [6,]     5   6

You can interpret the result as follows: in the rows, the index of the vector where the

string us is found. So the 3rd, 5th and 6th philosopher have us somewhere in their name.

The result also has two columns: start and end . These give the position of the string. So the

string us can be found starting at position 8 of the 3rd element of the vector, and ends at position

9. Same goes for the other philisophers. However, consider Marcus Aurelius. He has two names, both

ending with us . However, str_locate() only shows the position of the us in Marcus .

To get both us strings, you need to use str_locate_all() :

ancient_philosophers %>%
  str_locate_all("us")
## [[1]]
##      start end
## 
## [[2]]
##      start end
## 
## [[3]]
##      start end
## [1,]     8   9
## 
## [[4]]
##      start end
## 
## [[5]]
##      start end
## [1,]     7   8
## 
## [[6]]
##      start end
## [1,]     5   6
## [2,]    14  15

Now we get the position of the two us in Marcus Aurelius. Doing this on the winchester vector

will give use the position of the CONTENT string, but this is not really important right now. What

matters is that you know how str_locate() and str_locate_all() work.

So now that we know what interests us in the 43nd element of winchester , let’s take a closer

look at it:

winchester[43]

As you can see, it’s a mess:

<textline height="\"126.0\"" width="\"1731.0\"" hpos="\"17160.0\"" vpos="\"21252.0\"">
 <string height="\"114.0\"" width="\"354.0\"" hpos="\"17160.0\"" vpos="\"21264.0\"" content="\"0tV\"" wc="\"0.8095238\"" />
 <sp width="\"131.0\"" hpos="\"17514.0\"" vpos="\"21264.0\"" />
 <string stylerefs="\"ID7\"" height="\"111.0\"" width="\"474.0\"" hpos="\"17646.0\"" vpos="\"21258.0\"" content="\"BATES\"" wc="\"1.0\"" />
 <sp width="\"140.0\"" hpos="\"18120.0\"" vpos="\"21258.0\"" />
 <string stylerefs="\"ID7\"" height="\"114.0\"" width="\"630.0\"" hpos="\"18261.0\"" vpos="\"21252.0\"" content="\"President\"" wc="\"1.0\"">
  <alternative>
   Prcideht
  </alternative>
  <alternative>
   Pride
  </alternative>
 </string>
</textline>
<textline height="\"153.0\"" width="\"1689.0\"" hpos="\"17145.0\"" vpos="\"21417.0\"">
 <string stylerefs="\"ID7\"" height="\"105.0\"" width="\"258.0\"" hpos="\"17145.0\"" vpos="\"21439.0\"" content="\"WM\"" wc="\"0.82539684\"">
  <textline height="\"120.0\"" width="\"2211.0\"" hpos="\"16788.0\"" vpos="\"21870.0\"">
   <string stylerefs="\"ID7\"" height="\"96.0\"" width="\"102.0\"" hpos="\"16788.0\"" vpos="\"21894.0\"" content="\"It\"" wc="\"1.0\"" />
   <sp width="\"72.0\"" hpos="\"16890.0\"" vpos="\"21894.0\"" />
   <string stylerefs="\"ID7\"" height="\"96.0\"" width="\"93.0\"" hpos="\"16962.0\"" vpos="\"21885.0\"" content="\"is\"" wc="\"1.0\"" />
   <sp width="\"80.0\"" hpos="\"17055.0\"" vpos="\"21885.0\"" />
   <string stylerefs="\"ID7\"" height="\"102.0\"" width="\"417.0\"" hpos="\"17136.0\"" vpos="\"21879.0\"" content="\"seldom\"" wc="\"1.0\"" />
   <sp width="\"80.0\"" hpos="\"17553.0\"" vpos="\"21879.0\"" />
   <string stylerefs="\"ID7\"" height="\"96.0\"" width="\"267.0\"" hpos="\"17634.0\"" vpos="\"21873.0\"" content="\"hard\"" wc="\"1.0\"" />
   <sp width="\"81.0\"" hpos="\"17901.0\"" vpos="\"21873.0\"" />
   <string stylerefs="\"ID7\"" height="\"87.0\"" width="\"111.0\"" hpos="\"17982.0\"" vpos="\"21879.0\"" content="\"to\"" wc="\"1.0\"" />
   <sp width="\"81.0\"" hpos="\"18093.0\"" vpos="\"21879.0\"" />
   <string stylerefs="\"ID7\"" height="\"96.0\"" width="\"219.0\"" hpos="\"18174.0\"" vpos="\"21870.0\"" content="\"find\"" wc="\"1.0\"" />
   <sp width="\"77.0\"" hpos="\"18393.0\"" vpos="\"21870.0\"" />
   <string stylerefs="\"ID7\"" height="\"69.0\"" width="\"66.0\"" hpos="\"18471.0\"" vpos="\"21894.0\"" content="\"a\"" wc="\"1.0\"" />
   <sp width="\"77.0\"" hpos="\"18537.0\"" vpos="\"21894.0\"" />
   <string stylerefs="\"ID7\"" height="\"78.0\"" width="\"384.0\"" hpos="\"18615.0\"" vpos="\"21888.0\"" content="\"succes\"" wc="\"0.82539684\"">
    <alternative>
     success
    </alternative>
   </string>
  </textline>
  <textline height="\"126.0\"" width="\"2316.0\"" hpos="\"16662.0\"" vpos="\"22008.0\"">
   <string stylerefs="\"ID7\"" height="\"75.0\"" width="\"183.0\"" hpos="\"16662.0\"" vpos="\"22059.0\"" content="\"sor\"" wc="\"1.0\"">
    <alternative>
     soar
    </alternative>
   </string>
   <sp width="\"72.0\"" hpos="\"16845.0\"" vpos="\"22059.0\"" />
   <string stylerefs="\"ID7\"" height="\"90.0\"" width="\"168.0\"" hpos="\"16917.0\"" vpos="\"22035.0\"" content="\"for\"" wc="\"1.0\"" />
   <sp width="\"72.0\"" hpos="\"17085.0\"" vpos="\"22035.0\"" />
   <string stylerefs="\"ID7\"" height="\"69.0\"" width="\"267.0\"" hpos="\"17157.0\"" vpos="\"22050.0\"" content="\"even\"" wc="\"1.0\"">
    <alternative>
     cen
    </alternative>
    <alternative>
     cent
    </alternative>
   </string>
   <sp width="\"77.0\"" hpos="\"17434.0\"" vpos="\"22050.0\"" />
   <string stylerefs="\"ID7\"" height="\"66.0\"" width="\"63.0\"" hpos="\"17502.0\"" vpos="\"22044.0\""></string>
  </textline>
 </string>
</textline>

The file was imported without any newlines. So we need to insert them ourselves, by splitting the

string in a clever way.

Splitting strings

There are two functions included in {stringr} to split strings, str_split() and str_split_fixed() .

Let’s go back to our ancient philosophers. Two of them, Seneca the Younger and Marcus Aurelius have

something else in common than both being Roman Stoic philosophers. Their names are composed of several

words. If we want to split their names at the space character, we can use str_split() like this:

ancient_philosophers %>%
  str_split(" ")
## [[1]]
## [1] "aristotle"
## 
## [[2]]
## [1] "plato"
## 
## [[3]]
## [1] "epictetus"
## 
## [[4]]
## [1] "seneca"  "the"     "younger"
## 
## [[5]]
## [1] "epicurus"
## 
## [[6]]
## [1] "marcus"   "aurelius"

str_split() also has a simplify = TRUE option:

ancient_philosophers %>%
  str_split(" ", simplify = TRUE)
##      [,1]        [,2]       [,3]     
## [1,] "aristotle" ""         ""       
## [2,] "plato"     ""         ""       
## [3,] "epictetus" ""         ""       
## [4,] "seneca"    "the"      "younger"
## [5,] "epicurus"  ""         ""       
## [6,] "marcus"    "aurelius" ""

This time, the returned object is a matrix.

What about str_split_fixed() ? The difference is that here you can specify the number of pieces

to return. For example, you could consider the name “Aurelius” to be the middle name of Marcus Aurelius,

and the “the younger” to be the middle name of Seneca the younger. This means that you would want

to split the name only at the first space character, and not at all of them. This is easily achieved

with str_split_fixed() :

ancient_philosophers %>%
  str_split_fixed(" ", 2)
##      [,1]        [,2]         
## [1,] "aristotle" ""           
## [2,] "plato"     ""           
## [3,] "epictetus" ""           
## [4,] "seneca"    "the younger"
## [5,] "epicurus"  ""           
## [6,] "marcus"    "aurelius"

This gives the expected result.

So how does this help in our case? Well, if you look at how the ALTO file looks like, at the beginning

of this section, you will notice that every line ends with the “>” character. So let’s split at

that character!

winchester_text <- winchester[43] %>%
  str_split(">")

Let’s take a closer look at winchester_text :

str(winchester_text)
## List of 1
##  $ : chr [1:19706] "

So this is a list of length one, and the first, and only, element of that list is an atomic vector

with 19706 elements. Since this is a list of only one element, we can simplify it by saving the

atomic vector in a variable:

winchester_text <- winchester_text[[1]]

Let’s now look at some lines:

winchester_text[1232:1245]
##  [1] "
<sp width="\"66.0\"" hpos="\"5763.0\"" vpos="\"9696.0\"/"" 2="">
 <string stylerefs="\"ID7\"" height="\"108.0\"" width="\"612.0\"" hpos="\"5829.0\"" vpos="\"9693.0\"" content="\"Louisville\"" wc="\"1.0\""" 3="">
  <alternative 4="" loniile=""></alternative>
 </string>
</sp>

This now looks easier to handle. We can narrow it down to the lines that only contain the string

we are interested in, “CONTENT”. First, let’s get the indices:

content_winchester_index <- winchester_text %>%
  str_which("CONTENT")

How many lines contain the string “CONTENT”?

length(content_winchester_index)
## [1] 4462

As you can see, this reduces the amount of data we have to work with. Let us save this is a new

variable:

content_winchester <- winchester_text[content_winchester_index]

Matching strings

Matching strings is useful, but only in combination with regular expressions. As stated at the

beginning of this section, we are going to learn about regular expressions in Chapter 10, but in

order to make this section useful, we are going to learn the easiest, but perhaps the most useful

regular expression: .* .

Let’s go back to our ancient philosophers, and use str_match() and see what happens. Let’s match

the “us” string:

ancient_philosophers %>%
  str_match("us")
##      [,1]
## [1,] NA  
## [2,] NA  
## [3,] "us"
## [4,] NA  
## [5,] "us"
## [6,] "us"

Not very useful, but what about the regular expression .* ? How could it help?

ancient_philosophers %>%
  str_match(".*us")
##      [,1]             
## [1,] NA               
## [2,] NA               
## [3,] "epictetus"      
## [4,] NA               
## [5,] "epicurus"       
## [6,] "marcus aurelius"

That’s already very interesting! So how does .* work? To understand, let’s first start by using

. alone:

ancient_philosophers %>%
  str_match(".us")
##      [,1] 
## [1,] NA   
## [2,] NA   
## [3,] "tus"
## [4,] NA   
## [5,] "rus"
## [6,] "cus"

This also matched whatever symbol comes just before the “u” from “us”. What if we use two . instead?

ancient_philosophers %>%
  str_match("..us")
##      [,1]  
## [1,] NA    
## [2,] NA    
## [3,] "etus"
## [4,] NA    
## [5,] "urus"
## [6,] "rcus"

This time, we get the two symbols that immediately precede “us”. Instead of continuing like this

we now use the * , which matches zero or more of . . So by combining * and . , we can match

any symbol repeatedly, until there is nothing more to match. Note that there is also + , which works

similarly to * , but it matches one or more symbols.

There is also a str_match_all() :

ancient_philosophers %>%
  str_match_all(".*us")
## [[1]]
##      [,1]
## 
## [[2]]
##      [,1]
## 
## [[3]]
##      [,1]       
## [1,] "epictetus"
## 
## [[4]]
##      [,1]
## 
## [[5]]
##      [,1]      
## [1,] "epicurus"
## 
## [[6]]
##      [,1]             
## [1,] "marcus aurelius"

In this particular case it does not change the end result, but keep it in mind for cases like this one:

c("haha", "huhu") %>%
  str_match("ha")
##      [,1]
## [1,] "ha"
## [2,] NA

and:

c("haha", "huhu") %>%
  str_match_all("ha")
## [[1]]
##      [,1]
## [1,] "ha"
## [2,] "ha"
## 
## [[2]]
##      [,1]

What if we want to match names containing the letter “t”? Easy:

ancient_philosophers %>%
  str_match(".*t.*")
##      [,1]                
## [1,] "aristotle"         
## [2,] "plato"             
## [3,] "epictetus"         
## [4,] "seneca the younger"
## [5,] NA                  
## [6,] NA

So how does this help us with our historical newspaper? Let’s try to get the strings that come

after “CONTENT”:

winchester_content <- winchester_text %>%
  str_match("CONTENT.*")

Let’s use our faithful str() function to take a look:

winchester_content %>%
  str
##  chr [1:19706, 1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA ...

Hum, there’s a lot of NA values! This is because a lot of the lines from the file did not have the

string “CONTENT”, so there is no match possible. Let’s us remove all these NA s. Because the

result is a matrix, we cannot use the filter() function from {dplyr} . So we need to convert it

to a tibble first:

winchester_content <- winchester_content %>%
  as.tibble() %>%
  filter(!is.na(V1))
## Warning: `as.tibble()` is deprecated, use `as_tibble()` (but mind the new semantics).
## This warning is displayed once per session.

Because matrix columns do not have names, when a matrix gets converted into a tibble, the firt column

gets automatically called V1 . This is why I filter on this column. Let’s take a look at the data:

head(winchester_content)
## # A tibble: 6 x 1
##   V1                                  
##   
<chr>
                                
## 1 "CONTENT=\"J\" WC=\"0.8095238\"/"   
## 2 "CONTENT=\"a\" WC=\"0.8095238\"/"   
## 3 "CONTENT=\"Ira\" WC=\"0.95238096\"/"
## 4 "CONTENT=\"mj\" WC=\"0.8095238\"/"  
## 5 "CONTENT=\"iI\" WC=\"0.8095238\"/"  
## 6 "CONTENT=\"tE1r\" WC=\"0.8095238\"/"
</chr>

Searching and replacing strings

We are getting close to the final result. We still need to do some cleaning however. Since our data

is inside a nice tibble, we might as well stick with it. So let’s first rename the column and

change all the strings to lowercase:

winchester_content <- winchester_content %>% 
  mutate(content = tolower(V1)) %>% 
  select(-V1)

Let’s take a look at the result:

head(winchester_content)
## # A tibble: 6 x 1
##   content                             
##   
<chr>
                                
## 1 "content=\"j\" wc=\"0.8095238\"/"   
## 2 "content=\"a\" wc=\"0.8095238\"/"   
## 3 "content=\"ira\" wc=\"0.95238096\"/"
## 4 "content=\"mj\" wc=\"0.8095238\"/"  
## 5 "content=\"ii\" wc=\"0.8095238\"/"  
## 6 "content=\"te1r\" wc=\"0.8095238\"/"
</chr>

The second part of the string, “wc=….” is not really interesting. Let’s search and replace this

with an empty string, using str_replace() :

winchester_content <- winchester_content %>% 
  mutate(content = str_replace(content, "wc.*", ""))

head(winchester_content)
## # A tibble: 6 x 1
##   content            
##   
<chr>
               
## 1 "content=\"j\" "   
## 2 "content=\"a\" "   
## 3 "content=\"ira\" " 
## 4 "content=\"mj\" "  
## 5 "content=\"ii\" "  
## 6 "content=\"te1r\" "
</chr>

We need to use the regular expression from before to replace “wc” and every character that follows.

The same can be use to remove “content=”:

winchester_content <- winchester_content %>% 
  mutate(content = str_replace(content, "content=", ""))

head(winchester_content)
## # A tibble: 6 x 1
##   content    
##   
<chr>
       
## 1 "\"j\" "   
## 2 "\"a\" "   
## 3 "\"ira\" " 
## 4 "\"mj\" "  
## 5 "\"ii\" "  
## 6 "\"te1r\" "
</chr>

We are almost done, but some cleaning is still necessary:

Exctracting or removing strings

Now, because I now the ALTO spec, I know how to find words that are split between two sentences:

winchester_content %>% 
  filter(str_detect(content, "hyppart"))
## # A tibble: 64 x 1
##    content                                                               
##    
<chr>
                                                                  
##  1 "\"aver\" subs_type=\"hyppart1\" subs_content=\"average\" "           
##  2 "\"age\" subs_type=\"hyppart2\" subs_content=\"average\" "            
##  3 "\"considera\" subs_type=\"hyppart1\" subs_content=\"consideration\" "
##  4 "\"tion\" subs_type=\"hyppart2\" subs_content=\"consideration\" "     
##  5 "\"re\" subs_type=\"hyppart1\" subs_content=\"resigned\" "            
##  6 "\"signed\" subs_type=\"hyppart2\" subs_content=\"resigned\" "        
##  7 "\"install\" subs_type=\"hyppart1\" subs_content=\"installed\" "      
##  8 "\"ed\" subs_type=\"hyppart2\" subs_content=\"installed\" "           
##  9 "\"be\" subs_type=\"hyppart1\" subs_content=\"before\" "              
## 10 "\"fore\" subs_type=\"hyppart2\" subs_content=\"before\" "            
## # … with 54 more rows
</chr>

For instance, the word “average” was split over two lines, the first part of the word, “aver” on the

first line, and the second part of the word, “age”, on the second line. We want to keep what comes

after “subs_content”. Let’s extract the word “average” using str_extract() . However, because only

some words were split between two lines, we first need to detect where the string “hyppart1” is

located, and only then can we extract what comes after “subs_content”. Thus, we need to combine

str_detect() to first detect the string, and then str_extract() to extract what comes after

“subs_content”:

winchester_content <- winchester_content %>% 
  mutate(content = if_else(str_detect(content, "hyppart1"), 
                           str_extract_all(content, "content=.*", simplify = TRUE), 
                           content))

Let’s take a look at the result:

winchester_content %>% 
  filter(str_detect(content, "content"))
## # A tibble: 64 x 1
##    content                                                          
##    
<chr>
                                                             
##  1 "content=\"average\" "                                           
##  2 "\"age\" subs_type=\"hyppart2\" subs_content=\"average\" "       
##  3 "content=\"consideration\" "                                     
##  4 "\"tion\" subs_type=\"hyppart2\" subs_content=\"consideration\" "
##  5 "content=\"resigned\" "                                          
##  6 "\"signed\" subs_type=\"hyppart2\" subs_content=\"resigned\" "   
##  7 "content=\"installed\" "                                         
##  8 "\"ed\" subs_type=\"hyppart2\" subs_content=\"installed\" "      
##  9 "content=\"before\" "                                            
## 10 "\"fore\" subs_type=\"hyppart2\" subs_content=\"before\" "       
## # … with 54 more rows
</chr>

We still need to get rid of the string “content=” and then of all the strings that contain “hyppart2”,

which are not needed now:

winchester_content <- winchester_content %>% 
  mutate(content = str_replace(content, "content=", "")) %>% 
  mutate(content = if_else(str_detect(content, "hyppart2"), NA_character_, content))

head(winchester_content)
## # A tibble: 6 x 1
##   content    
##   
<chr>
       
## 1 "\"j\" "   
## 2 "\"a\" "   
## 3 "\"ira\" " 
## 4 "\"mj\" "  
## 5 "\"ii\" "  
## 6 "\"te1r\" "
</chr>

Almost done! We only need to remove the " characters:

winchester_content <- winchester_content %>% 
  mutate(content = str_replace_all(content, "\"", "")) 

head(winchester_content)
## # A tibble: 6 x 1
##   content
##   
<chr>
   
## 1 "j "   
## 2 "a "   
## 3 "ira " 
## 4 "mj "  
## 5 "ii "  
## 6 "te1r "
</chr>

Let’s remove space characters with str_trim() :

winchester_content <- winchester_content %>% 
  mutate(content = str_trim(content)) 

head(winchester_content)
## # A tibble: 6 x 1
##   content
##   
<chr>
   
## 1 j      
## 2 a      
## 3 ira    
## 4 mj     
## 5 ii     
## 6 te1r
</chr>

To finish off this section, let’s remove stop words (words that do not add any meaning to a sentence,

such as “as”, “and”…) and words that are composed of less than 3 characters. You can find a dataset

with stopwords inside the {stopwords} package:

library(stopwords)

data(data_stopwords_stopwordsiso)

eng_stopwords <- tibble("content" = data_stopwords_stopwordsiso$en)

winchester_content <- winchester_content %>% 
  anti_join(eng_stopwords) %>% 
  filter(nchar(content) > 3)
## Joining, by = "content"
head(winchester_content)
## # A tibble: 6 x 1
##   content   
##   
<chr>
      
## 1 te1r      
## 2 jilas     
## 3 edition   
## 4 winchester
## 5 news      
## 6 injuries
</chr>

That’s it for this section! You now know how to work with strings, but in Chapter 10 we are going

one step further by learning about regular expressions, which offer much more power.

Hope you enjoyed! If you found this blog post useful, you might want to follow

me on twitter for blog post updates and

buy me an espresso or paypal.me .

Buy me an Espresso

Related

To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software .

R-bloggers.com offers daily e-mail updates about R news andtutorials on topics such as:Data science, Big Data, R jobs , visualization (ggplot2, Boxplots , maps ,animation), programming (RStudio, Sweave , LaTeX , SQL , Eclipse , git , hadoop ,Web Scraping) statistics (regression, PCA , time series , trading

) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail , twitter , RSS , or facebook ...


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK