| Title: | 'rcpp' Wrapper for 'mecab' Library |
|---|---|
| Description: | R package based on 'Rcpp' for 'MeCab': Yet Another Part-of-Speech and Morphological Analyzer. The purpose of this package is providing a seamless developing and analyzing environment for CJK texts. This package utilizes parallel programming for providing highly efficient text preprocessing 'posParallel()' function. For installation, please refer to README.md file. |
| Authors: | Junhewk Kim [aut, cre], Taku Kudo [aut], Akiru Kato [ctb], Patrick Schratz [ctb] |
| Maintainer: | Junhewk Kim <[email protected]> |
| License: | GPL |
| Version: | 0.0.1.5 |
| Built: | 2026-05-23 06:47:25 UTC |
| Source: | https://github.com/junhewk/rcppmecab |
dict_index compiles a user dictionary CSV file into a binary
dictionary that can be used with pos and posParallel.
dict_index( dic_csv, out_dic, dic_dir, dic_charset = "utf-8", out_charset = "utf-8" )dict_index( dic_csv, out_dic, dic_dir, dic_charset = "utf-8", out_charset = "utf-8" )
dic_csv |
Character scalar. Path to the user dictionary CSV file(s). Multiple CSV files can be provided as a character vector. |
out_dic |
Character scalar. Path for the output compiled dictionary file. |
dic_dir |
Character scalar. Path to the system dictionary directory. This is required so that MeCab can reference the system dictionary configuration during compilation. |
dic_charset |
Character scalar. Charset of the input CSV file.
Default is |
out_charset |
Character scalar. Charset of the output dictionary.
Default is |
This function wraps MeCab's mecab-dict-index internally, so you
do not need the command-line tool installed separately.
Invisible TRUE on success.
## Not run: dict_index( dic_csv = "user_words.csv", out_dic = "user.dic", dic_dir = "/usr/local/lib/mecab/dic/ipadic" ) # Then use the compiled dictionary: pos("some text", user_dic = "user.dic") ## End(Not run)## Not run: dict_index( dic_csv = "user_words.csv", out_dic = "user.dic", dic_dir = "/usr/local/lib/mecab/dic/ipadic" ) # Then use the compiled dictionary: pos("some text", user_dic = "user.dic") ## End(Not run)
Downloads and installs a MeCab system dictionary for the specified language.
Japanese and Chinese dictionaries are compiled from source using the built-in
mecab-dict-index; Korean dictionaries are downloaded pre-compiled.
No system-level MeCab installation is required.
download_dic(lang)download_dic(lang)
lang |
Character scalar. Language code: |
Dictionaries are stored in the user data directory
(tools::R_user_dir("RcppMeCab", "data")).
Invisible path to the installed dictionary directory.
## Not run: download_dic("ja") download_dic("ko") download_dic("zh") pos("some text", lang = "ja") ## End(Not run)## Not run: download_dic("ja") download_dic("ko") download_dic("zh") pos("some text", lang = "ja") ## End(Not run)
Shows all available MeCab dictionaries, including the bundled dictionary
and any downloaded via download_dic.
list_dic()list_dic()
A data frame with columns lang, name, path,
and active.
## Not run: list_dic() ## End(Not run)## Not run: list_dic() ## End(Not run)
pos returns part-of-speech (POS) tagged morpheme of the sentence.
pos( sentence, join = TRUE, format = c("list", "data.frame"), lang = NULL, sys_dic = "", user_dic = "" )pos( sentence, join = TRUE, format = c("list", "data.frame"), lang = NULL, sys_dic = "", user_dic = "" )
sentence |
A character vector of any length. For analyzing multiple sentences, put them in one character vector. |
join |
A bool to decide the output format. The default value is TRUE. If FALSE, the function will return morphemes only, and tags put in the attribute. if |
format |
A data type for the result. The default value is "list". You can set this to "data.frame" to get a result as data frame format. |
lang |
Optional language code ( |
sys_dic |
A location of system MeCab dictionary. The default value is "". |
user_dic |
A location of user-specific MeCab dictionary. The default value is "". |
This is a basic function for MeCab part-of-speech tagger. The function gets a character vector of any length and runs a loop inside C++ to provide faster processing.
You can add a user dictionary to user_dic. It should be compiled by
mecab-dict-index. You can find an explanation about compiling a user
dictionary in the https://github.com/junhewk/RcppMeCab.
You can also set a system dictionary especially if you are using multiple
dictionaries (for example, using both IPA and Juman dictionary at the same time in Japanese)
in sys_dic. Using options(mecabSysDic=), you can set your
preferred system dictionary to the R terminal.
If you want to get a morpheme only, use join = False to put tag names on the attribute.
Basically, the function will return a list of character vectors with (morpheme)/(tag) elements.
A string vector or a list of POS tagged morpheme will be returned in conjoined character vector form.
## Not run: sentence <- c(#some UTF-8 texts) pos(sentence) pos(sentence, join = FALSE) pos(sentence, format = "data.frame") pos(sentence, lang = "ja") pos(sentence, lang = "ko") pos(sentence, sys_dic = "/path/to/custom/dic") pos(sentence, user_dic = "/path/to/user.dic") ## End(Not run)## Not run: sentence <- c(#some UTF-8 texts) pos(sentence) pos(sentence, join = FALSE) pos(sentence, format = "data.frame") pos(sentence, lang = "ja") pos(sentence, lang = "ko") pos(sentence, sys_dic = "/path/to/custom/dic") pos(sentence, user_dic = "/path/to/user.dic") ## End(Not run)
posParallel returns part-of-speech (POS) tagged morpheme of the sentence.
posParallel( sentence, join = TRUE, format = c("list", "data.frame"), lang = NULL, sys_dic = "", user_dic = "" )posParallel( sentence, join = TRUE, format = c("list", "data.frame"), lang = NULL, sys_dic = "", user_dic = "" )
sentence |
A character vector of any length. For analyzing multiple sentences, put them in one character vector. |
join |
A bool to decide the output format. The default value is TRUE. If FALSE, the function will return morphemes only, and tags put in the attribute. if |
format |
A data type for the result. The default value is "list". You can set this to "data.frame" to get a result as data frame format. |
lang |
Optional language code ( |
sys_dic |
A location of system MeCab dictionary. The default value is "". |
user_dic |
A location of user-specific MeCab dictionary. The default value is "". |
This is a parallelized version of MeCab part-of-speech tagger. The function gets a character vector of any length and runs a loop inside C++ with Intel TBB to provide faster processing.
Parallelizing over a character vector is not supported by RcppParallel.
Thus, this function makes duplicates of the input and the output.
Therefore, if your data volume is large, use pos or divide the vector to
several sub-vectors.
You can add a user dictionary to user_dic. It should be compiled by
mecab-dict-index. You can find an explanation about compiling a user
dictionary in the https://github.com/junhewk/RcppMeCab.
You can also set a system dictionary especially if you are using multiple
dictionaries (for example, using both IPA and Juman dictionary at the same time in Japanese)
in sys_dic. Using options(mecabSysDic=), you can set your
preferred system dictionary to the R terminal.
If you want to get a morpheme only, use join = False to put tag names on the attribute.
Basically, the function will return a list of character vectors with (morpheme)/(tag) elements.
A string vector or a list of POS tagged morpheme will be returned in conjoined character vector form.
## Not run: sentence <- c(#some UTF-8 texts) posParallel(sentence) posParallel(sentence, join = FALSE) posParallel(sentence, format = "data.frame") posParallel(sentence, lang = "ja") posParallel(sentence, lang = "ko") posParallel(sentence, sys_dic = "/path/to/custom/dic") posParallel(sentence, user_dic = "/path/to/user.dic") ## End(Not run)## Not run: sentence <- c(#some UTF-8 texts) posParallel(sentence) posParallel(sentence, join = FALSE) posParallel(sentence, format = "data.frame") posParallel(sentence, lang = "ja") posParallel(sentence, lang = "ko") posParallel(sentence, sys_dic = "/path/to/custom/dic") posParallel(sentence, user_dic = "/path/to/user.dic") ## End(Not run)
R package based on Rcpp for MeCab: Yet Another Part-of-Speech and
Morphological Analyzer (http://taku910.github.io/mecab/). The purpose of
this package is providing a seamless developing and analyzing environment for
CJK texts. This package utilizes parallel programming for providing
highly efficient text preprocessing posParallel() function.
For installation, please refer to README.md file.
This package utilizes MeCab C API and Rcpp codes.
Junhewk Kim Taku Kudo
Useful links:
Report bugs at https://github.com/junhewk/RcppMeCab/issues
Sets the default system dictionary used by pos and
posParallel. This is equivalent to calling
options(mecabSysDic = path) but allows selection by language code.
set_dic(lang)set_dic(lang)
lang |
Character scalar. Language code ( |
Invisible path to the activated dictionary directory.
## Not run: set_dic("ja") pos("some Japanese text") set_dic("ko") pos("some Korean text") ## End(Not run)## Not run: set_dic("ja") pos("some Japanese text") set_dic("ko") pos("some Korean text") ## End(Not run)