[[R Statistics]] ## Install (2021) 1. R from cran 2. Rstudio 3. Radian (ipython like) ## History S - developed by John Chambers at Bell Labs 1976 using Fortran White Book - Statistical Models in S (version 3 of S) Early versions did not have statistical modeling. 1998 - rewritten in C Green Book - Programming with Data (version 4 of S)1998 - Currently used Philosophy of S is to be an interactive language "Slide gradually into programming" R- is a free variant of S Created by Ross Ihaka & Robert Gentleman Base+ Packages (modular) ## Basics of R ### General Tips - expression - stuff written on console - `#` comment - `<-` is assigning - `:` operator creates sequence - `?` before a function gets help - `c()` = concatenate objects to form vectors - `ls()` lists the name space - `str` is powerful command to get info - Coercion - type casting - character>Numeric>logical i.e more generic - `as.numeric(x)` - ` as.logical(x)` - `as.character(x)` - `as.complex(x)` - Cannot convert characters to logicals you get NA - Object (everything in R) - Objects have "attributes" - All Objects have class - `class()` attribute tells you what it is like `type()` in python - Basic object is a vector - cannot have mixed types. All items have to be same class - `vector()` to create vectors - List - can contain different types - `x<-list(1, "a", TRUE)` - [[]] -elements are marked with double brackets Atomic classes - R has 5 atomic classes - Character (also strings) - numeric (real numbers) (64bit) - Integer - complex `2+4i` - Logical (`TRUE`/`FALSE`) - Integer - Suffix L `1L` will give an integer - otherwise R makes it double precision - 2 is numeric 2L is integer - Integers are numerics not vice versa - Numeric (real/float) - Double precision - Inf - infinity - NaN - not a number Attributes - objects have attributes - names, dimnames, dimensions, class, length Matrix (it’s a type of vector) - Also can be of only one class `m<-matrix(1:6, nrow=2, ncol=3)` - `rbind()` - adds rows - `cbind()` -add columns - Both command fill column first unless byrow=TRUE - "recycling" happens if size is not appropriate Factor - - Categorical data - Factors can be Ordered or unordered(Nominal or Ordinal) - `factor(c("yes", "yes", "no"), levels=c("yes", no))` <- create a factor list - greater or less works with ordered - Levels=c() option inside factor command lets you order - Can be thought of integer values with "labels" - First level it encouters is the "baseline level" - Attributes - `table()` - Gives a frequency table - `unclass()` - Strips the class - shows the integer - levels parameter in factor() can be used to set the order of the labels Missing Values - NA - undefined mathematical operations - Can have classes - there are Integer NAs , char NA etc. - NaN - numeric missing value - NaN is also NA but not vice versa - `good_df <- complete.cases(df)` Data frame - Key data type - tabular data - Special type of list - each element of the list has the same length - Each element of the list is a column and the length of the element is the number of rows - Can store columns of different classes - `row.names` attribute - each row can have a name - can be imported by using `read.table()` or `read.csv()` - Coorced to matrix by calling `data.matrix()` `x<-data.frame(foo=1:4, bar=c(T,T,F,F))` Reading stuff - `read.table` and `read.csv` (kinda like pd.read_csv) - `source` - like import in python (inverse of `dump`) - `readLines` read lines of a text document - `dget` also for r source code(inverse of `dput`) - `load` binary - other connection interface - `file` - `gzfile` - `bzfile` - `url` Reading large data (tips) - set `comment.char =""` if there are no commented lines - specify `colClasses` - use `nrows` to read only some rows `dump`ing & `dput`ing - Textual output which is basically r-code - Metadata is not lost with either Subsetting (aka indexing in python) - STARTS WITH 1 not 0 !!!! - `[` - some differences `a[,2]` is equivalent to `a[:,2]` in python - can be negative or logical eg. `x[!is.na(x)]` - `[[` - for data frame get one column - `
- for matching the name of the column (its awesome) - `x$a` can give same answer as `x[["aadvark"]]` Vectorization and Broadcasting - works like python. Broadcasting is called *"recycling"* Control Structures - `if(condition){}else{}` - `y <- if(condition{value}else{value}` assign conditionally - `for` - `for(i in 1:4){}` - `for(i in seq_along(vector)){}` - `for(i in seq_len(nrow(x))){}` - `for(element in vector){}` - `while(i < 10){}` - similar to C - `repeat{}` - infinite loop - use `break` to exit - `next` - Like `continue` in python/C `with` https://stackoverflow.com/questions/42283479/when-to-use-with-function-and-why-is-it-good ## Functions ```R add <- function(x,y){ x + y } ``` - Last value is returned like Scheme or explicit `return` - Can be passed to other functions - Arguments can have default values like python - functions can be called with positional arguments later than named arguments unlike python. It confusing as hell eg. `lm(data, y-x, model=FALSE, 1:100)` - arguments can also be partial matched - Lexical scoping aka static scoping - like python javascript - "Free variables" are searched in the environment in which the function was defined - Lazy evaluation - R won't complain for missing arguments if it is not used within the functions (unlike python) - `…` used to get variable number of arguments - Namespace - type `search()` to see the order in which the namespace is looked at. - R have separate namespace for variables and functions. For example variable c does not interfere with function c Date and Time - Dates are represented by Date class(days since 1970-01-01) - `as.Date("2021-02-02")` - Times are represented by `POSIXct` Or `POSIXlt` class - `Sys.time()` - `as.POSIXlt(Sys.time())$s` - give current second - `POSIXct` - stored as a very large integer under the hood - `POSIXlt` - stored as a list - `strptime()` - converts text to POSIXlt - `x <- strptime(c("Jan 10, 2021 10:40" , "Jan 20, 2022 3:23"), "%B %d, %Y %H:%M")` - `?strptime` for formats Loop Functions - `lapply(list, function, …)` - Anonymous function - on the fly function like lambda fn eg. `function(var) var[,1]` - returns a `list` (not a vector) - `...` can pass arguments - `sapply` - It tries to convert the lapply output to a vector or matrix - closer equivalent to `map` in scheme - `apply(array, margin, function,…)` - `array` is a vector or like - `margin` is a integer vector indicating which margins should be "retained" i.e like `axis` in pandas 1- collapse rows 2 - collapse columns - Kinda like a `reduce` - typically used to apply function to rows or columns of a matrix - Apply is not faster than a for loop just looks concise - `rowSums`,` rowMeans`, `colSums`, `colMeans` - lot faster than apply mean - `mapply(function, ... , MoreArgs = NULL, SIMPLIFY=TRUE` - Multivariate apply - apply the function on a bunch of arrays - `...` has the arrays to iterate over - Eg. `mapply(rep, 1:4, 2:5)` - `tapply(vector, INDEX, function, … simplify =TRUE)` - `INDEX` is a vector of same dimensions as `vector` . It is categorical telling which category each element of the vector belongs to. - `function` is applied to the vector - `INDEX` is a factor variable that can be generated using gl() - `split` - It splits a vector into a list based on a factor list (first step of `tapply`) Debug commands - three levels of information - `message` - some info - `warning` - - `error`- stop execution of function - `condition` - something that triggers the information -`traceback()` - after error happened - Prints function call stack - `debug(function)` then execute function line by line - n for next -` browser()` - puts into debug mode once encountered in code - `trace()` - insert debug code without editing function Random Numbers - `r` - generate random - `d` - evaluate probability density - `p` - evaluate cumulative density - `q` - for quantile (inverse of p) - Types - `norm` - `pois` - `gamma` - `unif` - example `rnorm(100)` - `set.seed()` - used generate the same random variables - `sample()` - sample from a vector