library(malaytextr)There is a data frame of Malay root words that can be used as a dictionary:
head(malayrootwords)
#>      Col Word Root Word
#> 1 pengabadian     abadi
#> 2  pengabdian      abdi
#> 3 pengacaraan     acara
#> 4 pengadangan     adang
#> 5  pengadilan      adil
#> 6   pengairan       airstem_malay() will find the root words in a dictionary,
in which the malayrootwords data frame can be used, then it
will remove “extra suffix”“,”prefix” and lastly “suffix”
To stem word “banyaknya”. It will return a data frame with the word “banyaknya” and the stemmed word “banyak”:
stem_malay(word = "banyaknya", dictionary = malayrootwords)
#> 'Root Word' is now returned instead of 'root_word'
#>    Col Word Root Word
#> 1 banyaknya    banyakTo stem words in a data frame:
x <- data.frame(text = c("banyaknya","sangat","terkedu", "pengetahuan"))
stem_malay(word = x, 
          dictionary = malayrootwords, 
          col_feature1 = "text")
#> 'Root Word' is now returned instead of 'root_word'
#>      Col Word Root Word
#> 1   banyaknya    banyak
#> 2      sangat    sangat
#> 3     terkedu      kedu
#> 4 pengetahuan      tahuremove_url will remove all urls found in a string
x <- c("test https://t.co/fkQC2dXwnc", "another one https://www.google.com/ to try")
remove_url(x)
#> [1] "test "               "another one  to try"There is a data frame of Malay stop words:
head(malaystopwords)
#> # A tibble: 6 × 1
#>   stopwords
#>   <chr>    
#> 1 ada      
#> 2 sampai   
#> 3 sana     
#> 4 itu      
#> 5 sangat   
#> 6 sayaThis lexicon includes words that have been labelled as positive or negative. This is useful for tasks like sentiment analysis, which involves determining the overall sentiment expressed in a piece of text. To use the lexicon, process the text and check each word against the lexicon to determine its sentiment. To note, this sentiment lexicon was created based on a general corpus, sourced from news articles
head(sentiment_general)
#> # A tibble: 6 × 2
#>   Word    Sentiment
#>   <chr>   <chr>    
#> 1 aduan   Negative 
#> 2 agresif Negative 
#> 3 amaran  Negative 
#> 4 anarki  Negative 
#> 5 ancaman Negative 
#> 6 aneh    Negative