In-class Exercise 5

Author

Seng Jing Yi

Published

May 11, 2024

Modified

May 30, 2024

Exploring Mini Challenge 1

VAST Challenge 2024

Packages:

Quanteda - Quantitative Analysis of Textual Data
Readtext - For reading text files in their various format.
Tidytext - R package for text mining

pacman::p_load(tidyverse, readtext, quanteda, tidytext)

Reading data where “/*” - Opening sub-directories within the data file and read ALL.

text_data <- readtext(paste0("data/articles", "/*"))
view(text_data)

Tokenising the article to identify key words (mainly nouns, excluding stop words)

usenet_words <- text_data %>%
  unnest_tokens(word,text) %>%
  filter(str_detect(word, "[a-z']$"), 
         !word %in% stop_words$word)

view(usenet_words)

Counting the most frequent word after tokenising

Consider: Stem to get the root form of the word before counting

usenet_words %>%
  count(word, sort = TRUE)

readtext object consisting of 3260 documents and 0 docvars.
# A data frame: 3,260 × 3
  word             n text     
  <chr>        <int> <chr>    
1 fishing       2177 "\"\"..."
2 sustainable   1525 "\"\"..."
3 company       1036 "\"\"..."
4 practices      838 "\"\"..."
5 industry       715 "\"\"..."
6 transactions   696 "\"\"..."
# ℹ 3,254 more rows

Breaking down the text data with tidyr - Regex with separate_wider_delim (Link: https://tidyr.tidyverse.org/reference/separate_wider_delim.html)

text_data_splitted <- text_data %>% 
  separate_wider_delim("doc_id", 
                       delim = "__0__", 
                       names = c("X", "Y"), 
                       too_few = "align_end")

References for text handling:

Text mining with R - tidytext: https://www.tidytextmining.com/
Using stringr to split text: https://stringr.tidyverse.org/
Using tidyr to delimit text files: https://tidyr.tidyverse.org/

Handling Network Data

Loading json package

pacman::p_load(jsonlite, tidyverse, tidyr)

mc1_data <-fromJSON("data/mc1.json")

Data model: Multiple knowledge graph - Nodes and links (already in dataframe)

Clicking into the nodes and link, will be able to see the underlying data.

#Seeing the underlying data under nodes
view(mc1_data[["nodes"]])
view(mc1_data[["links"]])

mc2_data <- fromJSON("data/mc2.json")

view(mc2_data[["nodes"]])
view(mc2_data[["links"]])

# Exporting for analysis

#write_csv( mc1_data[["nodes"]], "mc1_nodes.csv")
#write_csv(mc1_data[["links"]], "mc1_link.csv")
#write_csv(mc2_data[["nodes"]], "mc2_nodes.csv")
#write_csv(mc2_data[["links"]], "mc2_links.csv")

Drawing graph with network data

pacman::p_load(jsonlite, tidygraph, ggraph, visNetwork, graphlayouts, ggforce, skimr, tidytext, tidyverse)