| Title: | Korean User Interface for MeCab in R |
|---|---|
| Description: | This package provides useful functions for text mining in Korean. It depends major POS analysis on 'RcppMeCab' package. |
| Authors: | Junhewk Kim |
| Maintainer: | Junhewk Kim <[email protected]> |
| License: | GPL |
| Version: | 0.1.7.0 |
| Built: | 2026-05-23 05:18:06 UTC |
| Source: | https://github.com/junhewk/rmecabko |
install_dic installs Mecab-Ko-Dic.
install_dic()install_dic()
This code checks and installs Mecab-Ko-Dic in Linux and Mac OSX. This is essential for using custom-defined user dictionary. Installing Mecab-Ko-Dic needs system previleges, because it uses 'make install' to build from source and install it to system.
None. The function will halt when the current operation system is not Linux or Mac OSX, or Mecab-Ko-Dic is installed already.
See examples in Github.
## Not run: install_dic() ## End(Not run)## Not run: install_dic() ## End(Not run)
install_mecab installs Mecab-Ko-MSVC and Mecab-Ko-Dic-MSVC.
install_mecab(mecabLocation)install_mecab(mecabLocation)
mecabLocation |
a directory to install Mecab-Ko-MSVC and Mecab-Ko-Dic-MSVC. |
This code checks and installs Mecab-Ko-MSVC and Mecab-Ko-Dic-MSVC in user specified directory. Windows only.
None. The function will halt when the current operation system is not Windows, or /mecabLocation/mecab.exe exists.
See examples in Github.
## Not run: install_mecab("D:/Rlibs/mecab") ## End(Not run)## Not run: install_mecab("D:/Rlibs/mecab") ## End(Not run)
nouns returns nouns extracted from Korean phrases.
nouns(sentence, sys_dic = "", user_dic = "", parallel = FALSE)nouns(sentence, sys_dic = "", user_dic = "", parallel = FALSE)
phrase |
A character vector or character vectors. |
Noun extraction is used for many Korean text analysis algorithms. The function coerces input to UTF-8.
List of nouns will be returned. Element name of the list are original phrases.
See examples in Github.
## Not run: nouns(c("Some Korean Phrases")) ## End(Not run)## Not run: nouns(c("Some Korean Phrases")) ## End(Not run)
pos returns part-of-speech (POS) tagged morpheme of Korean phrases.
pos(sentence, join = TRUE, format = c("list", "data.frame"), sys_dic = "", user_dic = "", parallel = FALSE)pos(sentence, join = TRUE, format = c("list", "data.frame"), sys_dic = "", user_dic = "", parallel = FALSE)
sentence |
Character vector. |
join |
Boolean to determine providing POS tags with the morphemes or not. The default value is TRUE. |
format |
A data type for the result. The default value is "list". You can set this to "data.frame" to get a result as data frame format. |
sys_dic |
A location of system MeCab dictionary. The default value is "". |
user_dic |
A location of user-specific MeCab dictionary. The default value is "". |
parallel |
Boolean to determine using parallel analyzing. The default value is FALSE. |
This is a basic function of part-of-speech tagging by mecab-ko. The function coerces input to UTF-8.
List of POS tagged morpheme will be returned in conjoined character vecter form. Element name of the list are original phrases. If join=FALSE, it returns list of morpheme with named with tags.
See examples in Github.
## Not run: pos(c("Some Korean Phrases")) pos(c("Some Korean Phrases"), join=FALSE) ## End(Not run)## Not run: pos(c("Some Korean Phrases")) pos(c("Some Korean Phrases"), join=FALSE) ## End(Not run)
The mecab-ko and mecab-ko-dic is based on a C++ library,
and POS tagging with them is useful when the spacing of source text is not correct.
For integrating mecab-ko with R, Rcpp package is used for providing the basic framework.
It is based on the Eunjeon Project.
For Mac OSX and Linux, You need to install mecab-ko and mecab-ko-dic before install this package in R.
mecab-ko: https://bitbucket.org/eunjeon/mecab-ko
mecab-ko-dic: https://bitbucket.org/eunjeon/mecab-ko-dic
In Windows, install_mecab(mecabLocation) function will install mecab-ko-msvc and mecab-ko-dic-msvc in user specified directory.
It is operated by system command and file I/O, the speed of the analysis is slow compared to the Linux-based operating system.
Junhewk Kim
Wonsup Yoon, mecab-ko VC++ builds at https://github.com/Pusnow/mecab-ko-msvc, https://github.com/Pusnow/mecab-ko-dic-msvc
## Not run: # install.packages("devtools") devtools::install_github("junhewk/RmecabKo") # On Windows platform only install_mecab("D:/Rlibs/mecab") phrase <- # Some Korean character vectors # For full POS tagging pos(phrase) # For noun extraction only nouns(phrase) # For tokenizing of selective morphemes tokens_words(phrase) # For n-grams tokenizing tokens_ngram(phrase) ## End(Not run)## Not run: # install.packages("devtools") devtools::install_github("junhewk/RmecabKo") # On Windows platform only install_mecab("D:/Rlibs/mecab") phrase <- # Some Korean character vectors # For full POS tagging pos(phrase) # For noun extraction only nouns(phrase) # For tokenizing of selective morphemes tokens_words(phrase) # For n-grams tokenizing tokens_ngram(phrase) ## End(Not run)
These tokernizer functions perform tokenization into full or selected morphemes, nouns.
token_morph(phrase, strip_punct = FALSE, strip_numeric = FALSE) token_words(phrase, strip_punct = FALSE, strip_numeric = FALSE) token_nouns(phrase, strip_punct = FALSE, strip_numeric = FALSE)token_morph(phrase, strip_punct = FALSE, strip_numeric = FALSE) token_words(phrase, strip_punct = FALSE, strip_numeric = FALSE) token_nouns(phrase, strip_punct = FALSE, strip_numeric = FALSE)
phrase |
A character vector or a list of character vectors to be tokenized into morphemes.
If |
strip_punct |
Bool. If you want to remove punctuations in the phrase, set this as TRUE. |
strip_numeric |
Bool. If you want to remove numbers in the phrase, set this as TRUE. |
A list of character vectors containing the tokens, with one element in the list.
See examples in Github.
## Not run: txt <- # Some Korean sentence token_morph(txt) token_words(txt, strip_punct = FALSE) token_nouns(txt, strip_numeric = TRUE) ## End(Not run)## Not run: txt <- # Some Korean sentence token_morph(txt) token_words(txt, strip_punct = FALSE) token_nouns(txt, strip_numeric = TRUE) ## End(Not run)
This function tokenizes inputs into n-grams. For the developmental purpose, this function offers
basic n-gram (or shingle n-gram) only. Other n-gram functionality will be added later. Punctuations
and numerics are stripped for this tokenizer, because in Korean n-grams those are usually useless.
N-gram function is based on the selective morpheme tokenizer (token_words), but you can
select other tokenizer as well.
token_ngrams(phrase, n = 3L, div = c("morph", "words", "nouns"), stopwords = character(), ngram_delim = " ")token_ngrams(phrase, n = 3L, div = c("morph", "words", "nouns"), stopwords = character(), ngram_delim = " ")
phrase |
A character vector or a list of character vectors to be tokenized into morphemes.
If |
n |
The number of words in the n-gram. This must be an integer greater than or equal to 1. |
div |
The token generator definition. The options are "morph", "words", and "nouns". |
stopwords |
Stopwords set to exclude tokens. |
ngram_delim |
The separator between words in an n-gram. |
A list of character vectors containing the tokens, with one element in the list.
See examples in Github.
## Not run: txt <- # Some Korean sentence token_ngrams(txt) token_ngrams(txt, n = 2) ## End(Not run)## Not run: txt <- # Some Korean sentence token_ngrams(txt) token_ngrams(txt, n = 2) ## End(Not run)
words returns full morphemes extracted from Korean phrases.
words(phrase)words(phrase)
phrase |
Character vector. |
It is based on Mecab-Ko POS classification. Full morphemes are consisted with The function coerces input to UTF-8.
List of full morphemes will be returned.
See examples in Github.
## Not run: words(c("Some Korean Phrases")) ## End(Not run)## Not run: words(c("Some Korean Phrases")) ## End(Not run)