::p_load(tidyverse, readtext, quanteda, tidytext) pacman
In-class Exercise 5
Exploring Mini Challenge 1
VAST Challenge 2024
Packages:
Quanteda - Quantitative Analysis of Textual Data
Readtext - For reading text files in their various format.
Tidytext - R package for text mining
Reading data where “/*” - Opening sub-directories within the data file and read ALL.
<- readtext(paste0("data/articles", "/*"))
text_data view(text_data)
Tokenising the article to identify key words (mainly nouns, excluding stop words)
<- text_data %>%
usenet_words unnest_tokens(word,text) %>%
filter(str_detect(word, "[a-z']$"),
!word %in% stop_words$word)
view(usenet_words)
Counting the most frequent word after tokenising
Consider: Stem to get the root form of the word before counting
%>%
usenet_words count(word, sort = TRUE)
readtext object consisting of 3260 documents and 0 docvars.
# A data frame: 3,260 × 3
word n text
<chr> <int> <chr>
1 fishing 2177 "\"\"..."
2 sustainable 1525 "\"\"..."
3 company 1036 "\"\"..."
4 practices 838 "\"\"..."
5 industry 715 "\"\"..."
6 transactions 696 "\"\"..."
# ℹ 3,254 more rows
Breaking down the text data with tidyr
- Regex with separate_wider_delim
(Link: https://tidyr.tidyverse.org/reference/separate_wider_delim.html)
<- text_data %>%
text_data_splitted separate_wider_delim("doc_id",
delim = "__0__",
names = c("X", "Y"),
too_few = "align_end")
References for text handling:
- Text mining with R - tidytext: https://www.tidytextmining.com/
- Using
stringr
to split text: https://stringr.tidyverse.org/ - Using
tidyr
to delimit text files: https://tidyr.tidyverse.org/
Handling Network Data
Loading json package
::p_load(jsonlite, tidyverse, tidyr) pacman
<-fromJSON("data/mc1.json") mc1_data
Data model: Multiple knowledge graph - Nodes and links (already in dataframe)
Clicking into the nodes and link, will be able to see the underlying data.
#Seeing the underlying data under nodes
view(mc1_data[["nodes"]])
view(mc1_data[["links"]])
<- fromJSON("data/mc2.json") mc2_data
view(mc2_data[["nodes"]])
view(mc2_data[["links"]])
# Exporting for analysis
#write_csv( mc1_data[["nodes"]], "mc1_nodes.csv")
#write_csv(mc1_data[["links"]], "mc1_link.csv")
#write_csv(mc2_data[["nodes"]], "mc2_nodes.csv")
#write_csv(mc2_data[["links"]], "mc2_links.csv")
Drawing graph with network data
::p_load(jsonlite, tidygraph, ggraph, visNetwork, graphlayouts, ggforce, skimr, tidytext, tidyverse) pacman