Type: | Package |
Title: | Regression Analysis and Forecasting Using Textual Data from a Time-Varying Dictionary |
Version: | 0.1.3 |
Description: | Provides functionalities based on the paper "Time Varying Dictionary and the Predictive Power of FED Minutes" (Lima, 2018) <doi:10.2139/ssrn.3312483>. It selects the most predictive terms, that we call time-varying dictionary using supervised machine learning techniques as lasso and elastic net. |
Depends: | R (≥ 3.1.0) |
License: | GPL-3 |
Encoding: | UTF-8 |
LazyData: | true |
RoxygenNote: | 7.1.2 |
Imports: | forecast, stats, tidyr, tidytext, tm, wordcloud, dplyr, plyr, udpipe, RColorBrewer, ggplot2, glmnet, pdftools, parallel, doParallel, pracma, forcats, Matrix |
URL: | https://github.com/lucasgodeiro/TextForecast |
BugReports: | https://github.com/lucasgodeiro/TextForecast/issues |
Suggests: | knitr, rmarkdown, covr |
VignetteBuilder: | knitr |
NeedsCompilation: | no |
Packaged: | 2022-04-22 11:49:52 UTC; Lucas |
Author: | Luiz Renato Lima [aut], Lucas Godeiro [aut, cre] |
Maintainer: | Lucas Godeiro <lucas.godeiro@hotmail.com> |
Repository: | CRAN |
Date/Publication: | 2022-04-25 08:50:02 UTC |
get_collocations function
Description
get_collocations function
Usage
get_collocations(
corpus_dates,
path_name,
ntrms,
ngrams_number,
min_freq,
language
)
Arguments
corpus_dates |
a character vector indicating the subfolders where are located the texts. |
path_name |
the folders path where the subfolders with the dates are located. |
ntrms |
maximum numbers of collocations that will be filtered by tf-idf. We rank the collocations by tf-idf in a decreasing order. Then, after we select the words with the ntrms highest tf-idf. |
ngrams_number |
integer indicating the size of the collocations. Defaults to 2, indicating to compute bigrams. If set to 3, will find collocations of bigrams and trigrams. |
min_freq |
integer indicating the frequency of how many times a collocation should at least occur in the data in order to be returned. |
language |
the texts language. Default is english. |
Value
a list containing a sparse matrix with the all collocations couting and another with a tf-idf filtered collocations counting according to the ntrms.
Examples
st_year=2017
end_year=2018
path_name=system.file("news",package="TextForecast")
#qt=paste0(sort(rep(seq(from=st_year,to=end_year,by=1),12)),
#c("m1","m2","m3","m4","m5","m6","m7","m8","m9","m10","m11","m12"))
#z_coll=get_collocations(corpus_dates=qt[1:23],path_name=path_name,
#ntrms=500,ngrams_number=3,min_freq=10)
#
path_name=system.file("news",package="TextForecast")
days=c("2019-30-01","2019-31-01")
z_coll=get_collocations(corpus_dates=days[1],path_name=path_name,
ntrms=500,ngrams_number=3,min_freq=1)
get_terms function
Description
get_terms function
Usage
get_terms(
corpus_dates,
ntrms_words,
st,
path.name,
ntrms_collocation,
ngrams_number,
min_freq,
language
)
Arguments
corpus_dates |
a character vector indicating the subfolders where the texts are located. |
ntrms_words |
maximum numbers of words that will be filtered by tf-idf. We rank the word by tf-idf in a decreasing order. Then, we select the words with the ntrms highest tf-idf. |
st |
set 0 to stem the words and 1 otherwise. |
path.name |
the folders path where the subfolders with the dates are located. |
ntrms_collocation |
maximum numbers of collocations that will be filtered by tf-idf. We rank the collocations by tf-idf in a decreasing order. Then, after we select the words with the ntrms highest tf-idf. |
ngrams_number |
integer indicating the size of the collocations. Defaults to 2, indicating to compute bigrams. If set to 3, will find collocations of bigrams and trigrams. |
min_freq |
integer indicating the frequency of how many times a collocation should at least occur in the data in order to be returned. |
language |
the texts language. Default is english. |
Value
a list containing a sparse matrix with the all collocations and words couting and another with a tf-idf filtered collocations and words counting according to the ntrms.
Examples
st_year=2017
end_year=2018
path_name=system.file("news",package="TextForecast")
#qt=paste0(sort(rep(seq(from=st_year,to=end_year,by=1),12)),
#c("m1","m2","m3","m4","m5","m6","m7","m8","m9","m10","m11","m12"))
#z_terms=get_terms(corpus_dates=qt[1:23],path.name=path_name,
#ntrms_words=500,ngrams_number=3,st=0,ntrms_collocation=500,min_freq=10)
#
path_name=system.file("news",package="TextForecast")
days=c("2019-30-01","2019-31-01")
z_terms=get_terms(corpus_dates=days[1],path.name=path_name,
ntrms_words=500,ngrams_number=3,st=0,ntrms_collocation=500,min_freq=1)
get_words function
Description
get_words function
Usage
get_words(corpus_dates, ntrms, st, path_name, language)
Arguments
corpus_dates |
A vector of characters indicating the subfolders where are located the texts. |
ntrms |
maximum numbers of words that will be filtered by tf-idf. We rank the word by tf-idf in a decreasing order. Then, we select the words with the ntrms highest tf-idf. |
st |
set 0 to stem the words and 1 otherwise. |
path_name |
the folders path where the subfolders with the dates are located. |
language |
The texts language. |
Value
a list containing a sparse matrix with the all words couting and another with a td-idf filtered words counting according to the ntrms.
Examples
st_year=2017
end_year=2018
path_name=system.file("news",package="TextForecast")
#qt=paste0(sort(rep(seq(from=st_year,to=end_year,by=1),12)),
#c("m1","m2","m3","m4","m5","m6","m7","m8","m9","m10","m11","m12"))
#z_wrd=get_words(corpus_dates=qt[1:23],path_name=path_name,ntrms=500,st=0)
#
path_name=system.file("news",package="TextForecast")
days=c("2019-31-01","2019-31-01")
z_wrd=get_words(corpus_dates=days,path_name=path_name,ntrms=500,st=0)
hard thresholding
Description
hard thresholding
Usage
hard_thresholding(x, w, y, p_value, newx)
Arguments
x |
the input matrix x. |
w |
the optional input matrix w, that cannot be selected. |
y |
the response variable. |
p_value |
the threshold p-value. |
newx |
matrix that selection will applied. Useful for time series, when we need the observation at time t. |
Value
the variables less than p-value.
Examples
data("stock_data")
data("optimal_factors")
y=as.matrix(stock_data[,2])
y=as.vector(y)
w=as.matrix(stock_data[,3])
pc=as.matrix(optimal_factors)
t=length(y)
news_factor <- hard_thresholding(w=w[1:(t-1),],x=pc[1:(t-1),],y=y[2:t],p_value = 0.01,newx = pc)
News Data
Description
A simple tibble
containing the term counting of the financial news from the wall street journal and the news york times from 1992:01 through 2018:11.
Usage
news_data
Format
A tibble with 1631 components.
- dates
The vector of dates.
- X
The terms counting.
Title optimal alphas function
Description
Title optimal alphas function
Usage
optimal_alphas(x, w, y, grid_alphas, cont_folds, family)
Arguments
x |
A matrix of variables to be selected by shrinkrage methods. |
w |
A matrix or vector of variables that cannot be selected(no shrinkrage). |
y |
response variable. |
grid_alphas |
a grid of alphas between 0 and 1. |
cont_folds |
Set TRUE for contiguous folds used in time depedent data. |
family |
The glmnet family. |
Value
lambdas_opt a vector with the optimal alpha and lambda.
Examples
set.seed(1)
data("stock_data")
data("news_data")
y=as.matrix(stock_data[1:200,2])
w=as.matrix(stock_data[1:200,3])
data("news_data")
X=news_data[1:200,2:ncol(news_data)]
x=as.matrix(X)
grid_alphas=seq(by=0.25,to=1,from=0.5)
cont_folds=TRUE
t=length(y)
optimal_alphas=optimal_alphas(x[1:(t-1),],
w[1:(t-1),],y[2:t],grid_alphas,TRUE,"gaussian")
Optimal Factors
Description
A simple vector containing the Optimal factors select by optimal_number_factors function.
Usage
optimal_factors
Format
A vector with 1 component.
- optimal fators x
The vector of factor.
optimal number of factors function
Description
optimal number of factors function
Usage
optimal_number_factors(x, kmax)
Arguments
x |
a matrix x. |
kmax |
the maximum number of factors |
Value
a list with the optimal factors.
Examples
data("optimal_x")
optimal_factor <- optimal_number_factors(x=optimal_x,kmax=8)
Optimal x
Description
A simple matrix
containing the optimal words selected by Elastic Net from 1992:01 through 2018:11.
Usage
optimal_x
Format
A matrix with the most predictive terms.
- x
The matrix with 4 components.
Stock Data
Description
A simple tibble
containing the S&P 500 return and the VIX volatility index from 1992:01 through 2018:11.
Usage
stock_data
Format
A tibble with 3 components.
- dates
The vector of dates.
- sp_return
The S&P 500 returns.
- vix
The volatility index.
Text Forecast function
Description
Text Forecast function
Usage
text_forecast(x, y, h, intercept)
Arguments
x |
the input matrix x. |
y |
the response variable |
h |
the forecast horizon |
intercept |
TRUE for include intercept in the forecast equation. |
Value
The h step ahead forecast
Examples
set.seed(1)
data("stock_data")
data("news_data")
y=as.matrix(stock_data[,2])
w=as.matrix(stock_data[,3])
data("news_data")
data("optimal_factors")
pc=optimal_factors
z=cbind(w,pc)
fcsts=text_forecast(z,y,1,TRUE)
text nowcast
Description
text nowcast
Usage
text_nowcast(x, y, intercept)
Arguments
x |
the input matrix x. It should have 1 observation more that y. |
y |
the response variable |
intercept |
TRUE for include intercept in the forecast equation. |
Value
the nowcast h=0 for the variable y.
Examples
set.seed(1)
data("stock_data")
data("news_data")
y=as.matrix(stock_data[,2])
w=as.matrix(stock_data[,3])
data("news_data")
data("optimal_factors")
pc=optimal_factors
z=cbind(w,pc)
t=length(y)
ncsts=text_nowcast(z,y[1:(t-1)],TRUE)
tf-idf function
Description
tf-idf function
Usage
tf_idf(x)
Arguments
x |
a input matrix x of terms counting. |
Value
a list with the terms tf-idf and the terms tf-idf in descending order.
Examples
data("news_data")
X=as.matrix(news_data[,2:ncol(news_data)])
tf_idf_terms = tf_idf(X)
Top Terms Function
Description
Top Terms Function
Usage
top_terms(
x,
w,
y,
alpha,
lambda,
k,
wordcloud,
max.words,
scale,
rot.per,
family
)
Arguments
x |
the input matrix of terms to be selected. |
w |
optional argument. the input matrix of structured data to not be selected. |
y |
the response variable |
alpha |
the glmnet alpha |
lambda |
the glmnet lambda |
k |
the k top terms |
wordcloud |
set TRUE to plot the wordcloud |
max.words |
the maximum number of words in the wordcloud |
scale |
the wordcloud size. |
rot.per |
wordcloud proportion 90 degree terms |
family |
glmnet family |
Value
the top k terms and the corresponding wordcloud.
Examples
set.seed(1)
data("stock_data")
data("news_data")
y=as.matrix(stock_data[,2])
w=as.matrix(stock_data[,3])
data("news_data")
X=news_data[,2:ncol(news_data)]
x=as.matrix(X)
grid_alphas=seq(by=0.05,to=0.95,from=0.05)
cont_folds=TRUE
t=length(y)
optimal_alphas=optimal_alphas(x[1:(t-1),],w[1:(t-1),],
y[2:t],grid_alphas,TRUE,"gaussian")
top_trms<- top_terms(x[1:(t-1),],w[1:(t-1),],y[2:t],
optimal_alphas[[1]], optimal_alphas[[2]],10,TRUE,
10,c(2,0.3),.15,"gaussian")
tv dictionary function
Description
tv dictionary function
Usage
tv_dictionary(x, w, y, alpha, lambda, newx, family)
Arguments
x |
A matrix of variables to be selected by shrinkrage methods. |
w |
Optional Argument. A matrix of variables to be selected by shrinkrage methods. |
y |
the response variable. |
alpha |
the alpha required in glmnet. |
lambda |
the lambda required in glmnet. |
newx |
Matrix that selection will applied. Useful for time series, when we need the observation at time t. |
family |
the glmnet family. |
Value
X_star: a list with the coefficients and a sparse matrix with the most predictive terms.
Examples
set.seed(1)
data("stock_data")
data("news_data")
y=as.matrix(stock_data[1:200,2])
w=as.matrix(stock_data[1:200,3])
data("news_data")
X=news_data[1:200,2:ncol(news_data)]
x=as.matrix(X)
grid_alphas=seq(by=0.5,to=1,from=0.5)
cont_folds=TRUE
t=length(y)
optimal_alphas=optimal_alphas(x[1:(t-1),],w[1:(t-1),],
y[2:t],grid_alphas,TRUE,"gaussian")
x_star=tv_dictionary(x=x[1:(t-1),],w=w[1:(t-1),],y=y[2:t],
alpha=optimal_alphas[1],lambda=optimal_alphas[2],newx=x,family="gaussian")
tv sentiment index function
Description
tv sentiment index function
Usage
tv_sentiment_index(x, w, y, alpha, lambda, newx, family, k)
Arguments
x |
A matrix of variables to be selected by shrinkrage methods. |
w |
Optional Argument. A matrix of variables to be selected by shrinkrage methods. |
y |
the response variable. |
alpha |
the alpha required in glmnet. |
lambda |
the lambda required in glmnet. |
newx |
Matrix that selection will be applied. Useful for time series, when we need the observation at time t. |
family |
the glmnet family. |
k |
the highest positive and negative coefficients to be used. |
Value
The time-varying sentiment index. The index is based on the word/term counting and is computed using: tv_index=(pos-neg)/(pos+neg).
Examples
suppressWarnings(RNGversion("3.5.0"))
set.seed(1)
data("stock_data")
data("news_data")
y=as.matrix(stock_data[,2])
w=as.matrix(stock_data[,3])
data("news_data")
X=news_data[,2:ncol(news_data)]
x=as.matrix(X)
grid_alphas=0.05
cont_folds=TRUE
t=length(y)
optimal_alphas=optimal_alphas(x[1:(t-1),],w[1:(t-1),],
y[2:t],grid_alphas,TRUE,"gaussian")
tv_index <- tv_sentiment_index(x[1:(t-1),],w[1:(t-1),],y[2:t],
optimal_alphas[[1]],optimal_alphas[[2]],x,"gaussian",2)
TV sentiment index using all positive and negative coefficients.
Description
TV sentiment index using all positive and negative coefficients.
Usage
tv_sentiment_index_all_coefs(
x,
w,
y,
alpha,
lambda,
newx,
family,
scaled,
k_mov_avg,
type_mov_avg
)
Arguments
x |
A matrix of variables to be selected by shrinkrage methods. |
w |
Optional Argument. A matrix of variables to be selected by shrinkrage methods. |
y |
the response variable. |
alpha |
the alpha required in glmnet. |
lambda |
the lambda required in glmnet. |
newx |
Matrix that selection will be applied. Useful for time series, when we need the observation at time t. |
family |
the glmnet family. |
scaled |
Set TRUE for scale and FALSE for no scale. |
k_mov_avg |
The moving average order. |
type_mov_avg |
The type of moving average. See movavg. |
Value
A list with the net, postive and negative sentiment index. The net time-varying sentiment index. The index is based on the word/term counting and is computed using: tv_index=(pos-neg)/(pos+neg). The postive sentiment index is computed using: tv_index_pos=pos/(pos+neg) and the negative tv_index_neg=neg/(pos+neg).
Examples
suppressWarnings(RNGversion("3.5.0"))
set.seed(1)
data("stock_data")
data("news_data")
y=as.matrix(stock_data[,2])
w=as.matrix(stock_data[,3])
data("news_data")
X=news_data[,2:ncol(news_data)]
x=as.matrix(X)
grid_alphas=0.05
cont_folds=TRUE
t=length(y)
optimal_alphas=optimal_alphas(x=x[1:(t-1),],
y=y[2:t],grid_alphas=grid_alphas,cont_folds=TRUE,family="gaussian")
tv_idx=tv_sentiment_index_all_coefs(x=x[1:(t-1),],y=y[2:t],alpha = optimal_alphas[1],
lambda = optimal_alphas[2],newx=x,
scaled = TRUE,k_mov_avg = 4,type_mov_avg = "s")