%\VignetteIndexEntry{An R Package for Optimal Classification} %\VignetteDepends{pscl} %\VignetteKeywords{multivariate} %\VignettePackage{oc} \documentclass[12pt]{article} \usepackage{Sweave} \usepackage{amsmath} \usepackage{amscd} \begin{document} \title{Using Poole's Optimal Classification in R} \maketitle \section{Introduction} This package estimates Poole's Optimal Classification scores from roll call votes supplied though a \verb@rollcall@ object from package \verb@pscl@.\footnote{Production of this package is supported by NSF Grant SES-0611974.} Optimal Classification fits a Euclidean spatial model that places legislators in a specified number of dimensions (usually one or two). It maximizes the correct classification of legislative choices, whereas W-NOMINATE maximizes the probabilities of legislative choices given its error framework. It also differs from W-NOMINATE because it is a non-parametric procedure that requires no assumptions about the parametric form of the legislators' preference functions, other than assuming that they are symmetric and single--peaked. However, legislator coordinates recovered using OC are virtually identical to those recovered by parametric procedures. The R version of Optimal Classification improves upon the earlier software in three ways. First, it is now considerably easier to input new data for estimation, as the current software no longer relies exclusively on the old \emph{ORD} file format for data input. Secondly, roll call data can now be formatted and subsetted more easily using R's data manipulation capabilities. Finally, the \verb@oc@ package includes a full suite of graphics functions to analyze the results. This section briefly outlines the method by which OC scores are calculated. For a full description, readers are referred to chapters 1 through 3 of Keith Poole's \emph{Spatial Models of Parliamentary Voting.} and Poole's article in \emph{Political Analysis} \cite{Poole1} \cite{Poole2}. We begin this discussion by considering how OC works in one dimension. The one-dimensional optimal classification method can be summarized as follows: \begin{enumerate} \item Generate a starting estimate of the legislator rank ordering using singular value decomposition. \item Holding the legislator rank ordering fixed, use the \emph{Janice} algorithm (described below) to find the optimal cutting point ordering. \item Holding the cutting point ordering fixed, use the \emph{Janice} algorithm to find the optimal legislator ordering. \item Return to step 2 to iterate if the results change from the previous iteration \end{enumerate} Together, steps 2-4 constitute what is know as the \emph{Edith} algorithm. To understand the Janice algorithm, we provide a useful example. Suppose we have six legislators who vote on 5 roll calls as follows:\\ \begin{center} \begin{tabular}{|c|c|c|c|c|c|} \hline Legislators&$1$&$2$&$3$&$4$&5\\ \hline One & Y & Y & N & Y & Y\\ Two & N & Y & Y & Y & Y\\ Three & N & N & Y & Y & Y\\ Four & N & N & N & Y & Y\\ Five & N & N & N & N & Y\\ Six & N & N & N & N & N\\ \hline \end{tabular} \end{center} To recover the legislator ideal points $X_{1}$... $X_{6}$ with voting error, we generate an agreement score matrix and extract the first eigenvector from the double--centered agreement score matrix (not shown) as follows. The result here transposes the ideal points of legislators $X_{1}$ and $X_{2}$, demonstrating how \emph{Janice} orders the legislator ideal points.\\ \begin{center} \begin{tabular}{|c|c|c|c|c|c|} \hline %$Legislators$&Agreement Scores\\ \multicolumn{6}{|c|}{Agreement Scores}\\ \hline 1 & & & & &\\ .6 & 1 & & & &\\ .4 & .8 & 1 & & &\\ .6 & .6 & .8 & 1 & &\\ .4 & .4 & .6 & .8 & 1.0 &\\ .2 & .2 & .4 & .6 & .8 & 1.0\\ \hline \end{tabular}\\ \begin{tabular}{|c|} \hline First Eigenvector\\ \hline $X_{1} = -.44512$\\ $X_{2} = -.48973$\\ $X_{3} = -.11093$\\ $X_{4} = .04431$\\ $X_{5} = .34859$\\ $X_{6} = .65288$\\ \hline \end{tabular} \end{center} To see how \emph{Janice} chooses the cutting lines to minimize the number of classification errors, we first observe that the ideal points of the legislators are ordered such that $X_{2} < X_{1} < X_{3} < X_{4} < X_{5} < X_{6}$. From this ordering, we take the predicted votes conditional on the \emph{j}th cutline $z_{j}$ as follows, and simply measure the actual number of errors on the roll calls. In this example, we show only a table with Yeas on the Left and Nays on the Right; however, this procedure is usually also repeated with a table where Yeas are on the Right and Nays are on the Left. \begin{center} \begin{tabular}{|c|c|c|} \hline Cutline placement & Predicted Vote & Errors on Roll calls \\ &&1 2 3 4 5\\ \hline $Z_{j} < X_{2}$ & N N N N N N & 1 2 2 4 5\\ $X_{2} < Z_{j} < X_{1}$ & Y N N N N N & 2 1 1 3 4\\ $X_{1} < Z_{j} < X_{3}$ & Y Y N N N N & \textbf{\underline{1}} \textbf{\underline{0}} 2 2 3\\ $X_{3} < Z_{j} < X_{4}$ & Y Y Y N N N & 2 1 \textbf{\underline{1}} 1 2\\ $X_{4} < Z_{j} < X_{5}$ & Y Y Y Y N N & 3 2 2 \textbf{\underline{0}} 1\\ $X_{5} < Z_{j} < X_{6}$ & Y Y Y Y Y N & 4 3 3 1 \textbf{\underline{0}}\\ $X_{6} < Z_{j}$ & Y Y Y Y Y Y & 5 4 4 2 1\\ \hline \end{tabular} \end{center} The underlined placements of the five roll call cutlines thus minimize the number of classification errors in this example, and the application of the \emph{Janice} algorithm produces the following joint ordering of legislators and cutting points: \begin{center} $X_{1} < X_{2} < Z_{1} = Z_{2} < X_{3} < Z_{3} < X_{4} < Z_{4} < X_{5} < Z_{5} < X_{6}$ \end{center} In multiple dimensions, OC shares the same core algorithm as its single-dimension counterpart. The major difference in multiple dimensions is that cutting lines can no longer be tested exhaustively as in the one-dimensional case. Instead, the optimal cutting lines $N_{j}$ for a roll call matrix with $p$ legislators, $s$ roll calls, and $d$ dimensions are derived through projection onto a least squares line as follows: \begin{enumerate} \item Obtain a starting estimate of $N_{j}$, usually through least squares regression. \item Calculate the correct classifications associated with $N_{j}$. \item Construct $\Psi^{*}$, where: \begin{itemize} \item $\Psi_{i} = X_{i} + (c_{j} - w_{i})N_{j}$ if correctly classified and $\Psi_{i} = X_{i}$ if incorrectly classified. $\Psi_{i}$ is the $s x 1$ vector that is the $i$th row of $\Psi$, $X_{i}$ is the ideal point of legislator i, $c_{j}$ is the midpoint of roll call $j$, and $w_{i} = X_{i}'N_{j}$. \item $\Psi^{*}$ = $\Psi$ - $J_{p}u'$, where $J_{p}$ is a $p x 1$ vector of 1s and $u$ is a $d x 1$ vector of means. \end{itemize} \item Perform the singular value decomposition $U\Lambda V'$ of $\Psi^{*}$. \item Use the \emph{s}th singular vector (\emph{s}th column) $v_{s}$ of $V$ as the new estimate of $N_{j}$. \end{enumerate} The starting value generator, cutting line algorithm, and legislator ordering algorithm collectively consitute the OC algorithm that is implemented in this R package. \section{Usage Overview} The \verb@oc@ package was designed for use in one of three ways. First, users can estimate ideal points from a set of Congressional roll call votes stored in the traditional \emph{ORD} file format. Secondly, users can generate a vote matrix of their own, and feed it directly into \verb@oc@ for analysis. Finally, users can also generate test data with ideal points and bill parameters arbitrarily specified as arguments by the user for analysis with \verb@oc@. Each of these cases are supported by a similar sequence of function calls, as shown in the diagrams below: \begin{flushleft} $ \begin{CD} \texttt{\emph{ORD} file} @>\texttt{readKH()}>> \texttt{\emph{rollcall} object} @>\texttt{oc()}>> \texttt{\emph{OCobject}} \end{CD} $ $ \begin{CD} \texttt{Vote matrix} @>\texttt{rollcall()}>> \texttt{\emph{rollcall} object} @>\texttt{oc()}>> \texttt{\emph{OCobject}} \end{CD} $ \end{flushleft} Following generation of an \verb@OCobject@, the user then analyzes the results using the \emph{plot} and \emph{summary} methods, including: \begin{itemize} \item \textbf{plot.OCcoords()}: Plots ideal points in one or two dimensions. \item \textbf{plot.OCangles()}: Plots a histogram of cut lines. \item \textbf{plot.OCcutlines()}: Plots a specified percentage of cutlines (a Coombs mesh). \item \textbf{plot.OCskree()}: Plots a Skree plot with the first 20 eigenvalues. \item \textbf{plot.OCobject()}: S3 method for an \verb@OCobject@ that combines the four plots described above. \item \textbf{summary.OCobject()}: S3 method for an \verb@OCobject@ that summarizes the estimates. \end{itemize} Examples of each of the three cases described here are presented in the following sections. \section{Optimal Classification with ORD files} This is the use case that the majority of \verb@oc@ users are likely to fall into. Roll call votes in a fixed width format \emph{ORD} format for all U.S. Congresses are stored online for download at: \begin{itemize} \item \emph{https://legacy.voteview.com/} \item \emph{https://voteview.com} (updates votes in real time) \end{itemize} \verb@oc@ takes \verb@rollcall@ objects from Simon Jackman's \verb@pscl@ package as input. The package includes a function, \emph{readKH()}, that takes an \emph{ORD} file and automatically transforms it into a \verb@rollcall@ object as desired. Refer to the documentation in \verb@pscl@ for more detailed information on \emph{readKH()} and \emph{rollcall()}. Using the 90th Senate as an example, we can download the file \emph{sen90kh.ord} and read the data in R as follows:\\ <>= library(oc) #sen90 <- readKH("https://voteview.com/static/data/out/votes/S090_votes.ord") data(sen90) #Does same thing as above sen90 @ To make this example more interesting, suppose we were interested in applying \emph{oc()} only to bills that pertained in some way to agriculture. Keith Poole and Howard Rosenthal's VOTEVIEW software allows us to quickly determine which bills in the 90th Senate pertain to agriculture.\footnote{VOTEVIEW for Windows can be downloaded at \emph{legacy.voteview.com}.} Using this information, we create a vector of roll calls that we wish to select, then select for them in the \verb@rollcall@ object. In doing so, we should also take care to update the variable in the \verb@rollcall@ object that counts the total number of bills, as follows: <>= selector <- c(21,22,44,45,46,47,48,49,50,53,54,55,56,58,59,60,61,62,65,66,67,68,69,70,71,72,73,74,75,77,78,80,81,82,83,84,87,99,100,101,105,118,119,120,128,129,130,131,132,133,134,135,141,142,143,144,145,147,149,151,204,209,211,218,219,220,221,222,223,224,225,226,227,228,229,237,238,239,252,253,257,260,261,265,266,268,269,270,276,281,290,292,293,294,295,296,302,309,319,321,322,323,324,325,327,330,331,332,333,335,336,337,339,340,346,347,357,359,367,375,377,378,379,381,384,386,392,393,394,405,406,410,418,427,437,442,443,444,448,449,450,454,455,456,459,460,461,464,465,467,481,487,489,490,491,492,493,495,497,501,502,503,504,505,506,507,514,515,522,523,529,539,540,541,542,543,544,546,548,549,550,551,552,553,554,555,556,557,558,559,560,561,562,565,566,567,568,569,571,584,585,586,589,590,592,593,594,595) sen90$m <- length(selector) sen90$votes <- sen90$votes[,selector] @ \emph{oc()} takes a number of arguments described fully in the documentation. Most of the arguments can (and probably should) be left at their defaults, particularly when estimating ideal points from U.S. Congresses. The default options estimate ideal points in two dimensions without standard errors, using the same beta and weight parameters as described in the introduction. Votes where the losing side has less than 2.5 per cent of the vote, and legislators who vote less than 20 times are excluded from analysis. The most important argument that \emph{oc()} requires is a set of legislators who have positive ideal points in each dimension. This is the \emph{polarity} argument to \emph{oc()}. In two dimensions, this might mean a fiscally conservative legislator on the first dimension, and a socially conservative legislator on the second dimension. Polarity can be set in a number of ways, such as a vector of row indices (the recommended method), a vector of names, or by any arbitrary column in the \emph{legis.data} element of the \verb@rollcall@ object. Here, we use Senators Sparkman and Bartlett to set the polarity for the estimation. The names of the first 12 legislators are shown, and we can see that Sparkman and Bartlett are the second and fifth legislators respectively. <>= rownames(sen90$votes)[1:12] result <- oc(sen90, polarity=c(2,5)) @ \verb@result@ now contains all of the information from the OC estimation, the details of which are fully described in the documentation for \emph{oc()}. \verb@result$legislators@ contains all of the information from the \verb@PERF25.DAT@ file from the old Fortran \emph{oc()}, while \verb@result$rollcalls@ contains all of the information from the old \verb@PERF21.DAT@ file. The information can be browsed using the \emph{fix()} command as follows (not run):\\ \noindent > \emph{legisdata <- result\$legislators}\\ > \emph{fix(legisdata)}\\ For those interested in just the ideal points, a much better way to do this is to use the \emph{summary()} function: <>= summary(result) @ \verb@result@ can also be plotted, with a basic summary plot achieved as follows as shown Figure 1: \begin{figure} \begin{center} <>= plot(result) @ \end{center} \caption{Summary Plot of 90th Senate Agriculture Bill OC Scores} \label{fig:one} \end{figure} This basic plot splits the window into 4 parts and calls \emph{plot.OCcoords()}, \emph{plot.OCangles()}, \emph{plot.OCskree()}, and \emph{plot.OCcutlines()} sequentially. Each of these four functions can be called individually. In this example, the coordinate plot on the top left plots each legislator with their party affiliation. A unit circle is included to illustrate how OC scores are constrained to lie within a unit circle. Observe that with agriculture votes, party affiliation does not appear to be a strong predictor on the first dimension, although the second dimension is largely divided by party line. The skree plot shows the first 20 eigenvalues, and the rapid decline after the second eigenvalue suggests that a two-dimensional model describes the voting behavior of the 90th Senate well. The final plot shows 50 random cutlines, and can be modified to show any desired number of cutlines as necessary. Three things should be noted about the use of the \emph{plot()} functions. First, the functions always plot the results from the first two dimensions, but the dimensions used (as well as titles and subheadings) can all be changed by the user if, for example, they wish to plot dimensions 2 and 3 instead. Secondly, plots of one dimensional \verb@oc@ objects work somewhat differently than in two dimensions and are covered in the example in the final section. Finally, \emph{plot.OCcoords()} can be modified to include cutlines from whichever votes the user desires. The cutline of the 14th agricultural vote (corresponding to the 58th actual vote) from the 90th Senate with ideal points is plotted below in Figure 2, showing that the vote largely broke down along partisan lines. \begin{figure} \begin{center} <>= par(mfrow=c(1,1)) plot.OCcoords(result,cutline=14) @ \end{center} \caption{90th Senate Agriculture Bill OC Scores with Cutline} \label{fig:two} \end{figure} \section{Optimal Classification with arbitrary vote matrix} This section describes an example of OC being used for roll call data not already in \emph{ORD} format. The example here is drawn from the first three sessions of the United Nations, discussed further as Figure 5.8 in Keith Poole's \emph{Spatial Models of Parliamentary Voting} \cite{Poole1}. To create a \verb@rollcall@ object for use with \emph{oc()}, one ideally should have three things: \begin{itemize} \item A matrix of votes from some source. The matrix should be arranged as a \textit{legislators} x \textit{votes} matrix. It need not be in 1/6/9 or 1/0/NA format, but users must be able to distinguish between Yea, Nay, and missing votes. \item A vector of names for each member in the vote matrix. \item OPTIONAL: A vector describing the party or party-like memberships for the legislator. \end{itemize} The \verb@oc@ package includes all three of these items for the United Nations, which can be loaded and browsed with the code shown below. The data comes from Eric Voeten at George Washington University. In practice, one would prepare a roll call data set in a spreadsheet, like the one available one \emph{legacy.voteview.com/k7ftp/dtaord/UN.csv}, and read it into R using \emph{read.csv()}. The csv file is also stored in this package and can be read using:\\ \noindent \emph{UN<-read.csv(``library/oc/data/UN.csv'',header=FALSE,strip.white=TRUE) }\\ The line above reads the exact same data as what is stored in this package as R data, which can be obtained using the following commands: <>= rm(list=ls(all=TRUE)) data(UN) UN<-as.matrix(UN) UN[1:5,1:6] @ Observe that the first column are the names of the legislators (in this case, countries), and the second column lists whether a country is a ``Warsaw Pact'' country or ``Other'', which in this case can be thought of as a `party' variable. All other observations are votes. Our objective here is to use this data to create a \verb@rollcall@ object through the \emph{rollcall} function in \verb@pscl@. The object can then be used with \emph{oc()} and its plot/summary functions as in the previous \emph{ORD} example. To do this, we want to extract a vector of names (\emph{UNnames}) and party memberships (\emph{party}), then delete them from the original matrix so we have a matrix of nothing but votes. The \emph{party} variable must be rolled into a matrix as well for inclusion in the \verb@rollcall@ object as follows: <>= UNnames<-UN[,1] legData<-matrix(UN[,2],length(UN[,2]),1) colnames(legData)<-"party" UN<-UN[,-c(1,2)] @ In this particular vote matrix, Yeas are numbered 1, 2, and 3, Nays are 4, 5, and 6, abstentions are 7, 8, and 9, and 0s are missing. Other vote matrices are likely different so the call to \emph{rollcall} will be slightly different depending on how votes are coded. Party identification is included in the function call through \verb@legData@, and a \verb@rollcall@ object is generated and applied to OC as follows. A one dimensional OC model is fitted, and result is summarized below and plotted in Figure 3: <>= rc <- rollcall(UN, yea=c(1,2,3), nay=c(4,5,6), missing=c(7,8,9),notInLegis=0, legis.names=UNnames, legis.data=legData, desc="UN Votes", source="legacy.voteview.com") result<-oc(rc,polarity=1,dims=1) @ <>= summary(result) @ \begin{figure} \begin{center} <>= plot(result) @ \end{center} \caption{Summary Plot of UN Data} \label{fig:three} \end{figure} Note that the one dimensional plot differs considerably from the previous two dimensional plots, since only a coordinate plot and a Skree plot are shown. This is because in one dimension, all cutlines are angled at 90$^\circ$, so there is no need to plot either the cutlines or a histogram of cutline angles. Also, the plot appears to be compressed, so users need to expand the image manually by using their mouse and dragging along the corner of the plot to expand it. \newpage \begin{thebibliography}{2} \bibitem{Poole1} Poole, Keith (2000) ``Non-Parametric Unfolding of Binary Choice Data.'' \emph{Political Analysis} 8: 211-237. \bibitem{Poole2} Poole, Keith (2005) \emph{Spatial Models of Parliamentary Voting.} Cambridge: Cambridge University Press. \end{thebibliography} \end{document}