Help for package tmcn

License:

LGPL-2 | LGPL-2.1 | LGPL-3 [expanded from: LGPL]

Title:

A Text Mining Toolkit for Chinese

Type:

Package

LazyLoad:

yes

Author:

Jian Li

Maintainer:

Jian Li <rweibo@sina.com>

Description:

A Text mining toolkit for Chinese, which includes facilities for Chinese string processing, Chinese NLP supporting, encoding detecting and converting. Moreover, it provides some functions to support 'tm' package in Chinese.

Version:

0.2-13

Date:

2019-08-04

Depends:

R (≥ 3.0.0), utils

Suggests:

RoxygenNote:

6.1.1

NeedsCompilation:

yes

Packaged:

2019-08-08 03:36:59 UTC; jli

Repository:

CRAN

Date/Publication:

2019-08-08 04:40:02 UTC

GBK character set

Description

GBK character set including some useful information.

Usage

data(GBK)

Format

A data frame with 8 columns.

GBK: Chinese characters in UTF-8.
py0: Unique Pinyin of each character.
py: Pinyin string of each character.
Radical: In Chinese, it means 'Bu Shou'.
Stroke_Num_Radical: In Chinese, it means the number of 'Bi Hua'.
Stroke_Order: In Chinese, it means 'Bi Shun'.
Structure: In Chinese, it means 'Zi Ti Jie Gou'.
Freq: Frequency of the character in Sogou news corpus from all sites between June and July 2012.

Author(s)

Jian Li <rweibo@sina.com>

National Taiwan University Semantic Dictionary

Description

National Taiwan University Semantic Dictionary.

Usage

data(NTUSD)

Format

A list with 4 components.

positive_chs: Positive words in simplified Chinese
negative_chs: Negative words in simplified Chinese
positive_cht: Positive words in traditional Chinese
negative_cht: Negative words in traditional Chinese

References

http://nlg.csie.ntu.edu.tw

Dictionary of simplified and traditional Chinese

Description

Dictionary of simplified and traditional Chinese.

Usage

data(SIMTRA)

Format

A data frame with 2 columns.

Sim: a simplified Chinese string.
Tra: a traditional Chinese string.

Sport news.

Description

Sport news.

Usage

data(SPORT)

Format

A data frame with 6 columns.

id: ID of the news.
time: Time of the news.
title: Title of the news.
class: Class of the news, 'B' means Basketball, 'F' means Football.
abstract: Abstract of the news.
content: Content of the news.

Dictionary of Chinese stop words

Description

Dictionary of Chinese stop words.

Usage

data(STOPWORDS)

Format

A data frame with 1 column.

word: a string vertor of the stop words.

Print the UTF-8 codes of a string.

Description

Print the UTF-8 codes of a string.

Usage

catUTF8(string, file = "")

Arguments

string

A character vector.

file

A connection, or a character string naming the file to print to. If "" (the default), cat prints to the standard output connection, the console unless redirected by sink.

Value

No results.

Author(s)

Jian Li <rweibo@sina.com>

Examples

catUTF8("hello")

Create a Chinese term-document matrix or a document-term matrix.

Description

Create a Chinese term-document matrix or a document-term matrix.

Usage

createDTM(string, language = c("zh", "en"), tokenize = NULL, removePunctuation = TRUE, 
  removeNumbers = TRUE, removeStopwords = TRUE)
createTDM(string, language = c("zh", "en"), tokenize = NULL, removePunctuation = TRUE, 
  removeNumbers = TRUE, removeStopwords = TRUE)

Arguments

string

A character vector.

language

The language type, 'zh' means Chinese.

tokenize

A tokenizers function.

removePunctuation

Whether to remove the punctuations.

removeNumbers

Whether to remove the numbers.

removeStopwords

Whether to remove the stop words.

Details

Package "tm" is required.

Value

An object of class TermDocumentMatrix or class DocumentTermMatrix.

Author(s)

Jian Li <rweibo@sina.com>

Create a word frequency data.frame.

Description

Create a word frequency data.frame.

Usage

createWordFreq(obj, onlyCN = TRUE, nosymbol = TRUE, stopwords = NULL,
  useStopDic = FALSE)

Arguments

obj

A character vector or DocumentTermMatrix to calculate words frequency.

onlyCN

Whether to keep only Chinese words.

nosymbol

Whether to keep symbols.

stopwords

A character vector of stop words.

useStopDic

Whether to use the default stop words.

Value

A data.frame.

Author(s)

Jian Li <rweibo@sina.com>

Examples

createWordFreq(c("a", "a", "b", "c"), onlyCN = FALSE, nosymbol = TRUE, useStopDic = FALSE)

Get the current encoding of the locale.

Description

Get the current encoding of the locale.

Usage

getCharset()

Value

Character of encoding.

Author(s)

Jian Li <rweibo@sina.com>

Examples

getCharset()

Indicate whether the encoding of input string is BIG5.

Description

Indicate whether the encoding of input string is BIG5.

Usage

isBIG5(string, combine = FALSE)

Arguments

string

A character vector.

combine

Whether to combine all the strings.

Value

Logical value.

Author(s)

Jian Li <rweibo@sina.com>

Examples

isBIG5("hello")

Indicate whether the encoding of input string is GB18030.

Description

Indicate whether the encoding of input string is GB18030.

Usage

isGB18030(string, combine = FALSE)

Arguments

string

A character vector.

combine

Whether to combine all the strings.

Value

Logical value.

Author(s)

Jian Li <rweibo@sina.com>

Examples

isGB18030("hello")

Indicate whether the encoding of input string is GB2312.

Description

Indicate whether the encoding of input string is GB2312.

Usage

isGB2312(string, combine = FALSE)

Arguments

string

A character vector.

combine

Whether to combine all the strings.

Value

Logical value.

Author(s)

Jian Li <rweibo@sina.com>

Examples

isGB2312("hello")

Indicate whether the encoding of input string is GBK.

Description

Indicate whether the encoding of input string is GBK.

Usage

isGBK(string, combine = FALSE)

Arguments

string

A character vector.

combine

Whether to combine all the strings.

Value

Logical value.

Author(s)

Jian Li <rweibo@sina.com>

Examples

isGBK("hello")

Indicate whether the encoding of input string is UTF-8.

Description

Indicate whether the encoding of input string is UTF-8.

Usage

isUTF8(string, combine = FALSE)

Arguments

string

A character vector.

combine

Whether to combine all the strings.

Value

Logical value.

Author(s)

Jian Li <rweibo@sina.com>

Examples

isUTF8("hello")

Extract the left or right substrings in a character vector.

Description

Extract the left or right substrings in a character vector.

Usage

left(string, n)
right(string, n)

Arguments

string

A character vector.

n

How many characters.

Value

A character vector.

Author(s)

Jian Li <rweibo@sina.com>

Examples

left("hello", 3)

Revert UTF-8 string to Chinese character.

Description

Revert UTF-8 string to Chinese character.

Usage

revUTF8(string, utype = "R")

Arguments

string

A character vector.

utype

UTF-8 string type, the default is R type, such as "<U+XXXX>".

Value

A character vector.

Author(s)

Jian Li <rweibo@sina.com>

Set locale to Simplified Chinese/Traditional Chinese/UK.

Description

Set locale to Simplified Chinese/Traditional Chinese/UK.

Usage

setchs(rev = FALSE)
setcht(rev = FALSE)
setuk(rev = FALSE)

Arguments

rev

Whethet to set the locale back.

Value

No results.

Author(s)

Jian Li <rweibo@sina.com>

Examples

setchs()
setchs(rev = TRUE)

Return Chinese stop words.

Description

Return Chinese stop words.

Usage

stopwordsCN(stopwords = NULL, useStopDic = TRUE)

Arguments

stopwords

A character vector of stop words.

useStopDic

Whether to use the default stop words.

Value

A vector of stop words.

Author(s)

Jian Li <rweibo@sina.com>

Examples

stopwordsCN("yes", useStopDic = FALSE)

Mixed case capitalizing.

Description

To capitalize every first letter of a word.

Usage

strcap(string, strict = FALSE)

Arguments

string

A character vector.

strict

Whether strict.

Value

A character vector with the first letter of each word capitalized.

Author(s)

Jian Li <rweibo@sina.com>

Examples

strcap("the quick red fox jumps over the lazy brown dog")

Extract matched substrings by regular expression.

Description

Extract matched substrings by regular expression.

Usage

strextract(string, pattern, invert = FALSE, ignore.case = FALSE,
  perl = FALSE, useBytes = FALSE)

Arguments

string

A character vector.

pattern

A character string containing a regular expression to be matched in the given character vector.

invert

A logical value: if TRUE, extract the non-matched substrings.

ignore.case

If FALSE, the pattern matching is case sensitive and if TRUE, case is ignored during matching.

perl

A logical value. Should perl-compatible regexps be used?

useBytes

A logical value. If TRUE the matching is done byte-by-byte rather than character-by-character.

Value

A character vector with the matched or non-matched substrings.

Author(s)

Jian Li <rweibo@sina.com>

Examples

txt1 <- c("\t(x1)a(aa2)a ", " bb(bb)")
strextract(txt1, "\\([^)]*\\)")
txt2 <- c("  Ben Franklin and Jefferson Davis", "\tMillard Fillmore")
strextract(txt2, "(?<first>[[:upper:]][[:lower:]]+)", perl = TRUE)

Pad a string to a specified length with a padding character.

Description

Pad a string to a specified length with a padding character.

Usage

strpad(string, width = 0, side = c("left", "right", "both"),
  pad = " ")

Arguments

string

A character vector.

width

The number of characters of the string after padding.

side

Which side to pad.

pad

The padding character.

Value

A character vector after padding.

Author(s)

Jian Li <rweibo@sina.com>

Examples

strpad(1:5, width = 4, pad = "0")

Trim space of a string.

Description

Trim space of a string.

Usage

strstrip(string, side = c("both", "left", "right"))

Arguments

string

A character vector.

side

Which side of the string to be trimed, 'both', 'left' or 'right'.

Value

Trimed vector.

Author(s)

Jian Li <rweibo@sina.com>

Examples

strstrip(c("\taaaa ", " bbbb    "))

Convert a chinese text to pinyin format.

Description

Convert a chinese text to pinyin format.

Usage

toPinyin(string, capitalize = FALSE)

Arguments

string

A character vector.

capitalize

Whether to capitalize the first letter of each word.

Value

A character vector in pinyin format.

Author(s)

Jian Li <rweibo@sina.com>

Examples

toPinyin("the quick red fox jumps over the lazy brown dog")

Convert a Chinese text from simplified to traditional characters and vice versa.

Description

Convert a chinese text from simplified to traditional characters and vice versa.

Usage

toTrad(string, rev = FALSE)

Arguments

string

A Chinese string vector.

rev

Reverse. TRUE means traditional to simplified. Default is FALSE.

Value

Converted vectors.

Author(s)

Jian Li <rweibo@sina.com>

Examples

toTrad("hello")

Convert encoding of Chinese string to UTF-8.

Description

Convert encoding of Chinese string to UTF-8.

Usage

toUTF8(cnstring)

Arguments

cnstring

A Chinese string vector.

Value

Converted vectors.

Author(s)

Jian Li <rweibo@sina.com>

Examples

toUTF8("hello")