[[R Statistics]]
## Install
(2021)
1. R from cran
2. Rstudio
3. Radian (ipython like)
## History
S - developed by John Chambers at Bell Labs 1976 using Fortran
White Book - Statistical Models in S (version 3 of S)
Early versions did not have statistical modeling.
1998 - rewritten in C
Green Book - Programming with Data (version 4 of S)1998
- Currently used
Philosophy of S is to be an interactive language "Slide gradually into programming"
R- is a free variant of S
Created by Ross Ihaka & Robert Gentleman
Base+ Packages (modular)
## Basics of R
### General Tips
- expression - stuff written on console
- `#` comment
- `<-` is assigning
- `:` operator creates sequence
- `?` before a function gets help
- `c()` = concatenate objects to form vectors
- `ls()` lists the name space
- `str` is powerful command to get info
- Coercion - type casting
- character>Numeric>logical i.e more generic
- `as.numeric(x)`
- ` as.logical(x)`
- `as.character(x)`
- `as.complex(x)`
- Cannot convert characters to logicals you get NA
- Object (everything in R)
- Objects have "attributes"
- All Objects have class - `class()` attribute tells you what it is like `type()` in python
- Basic object is a vector
- cannot have mixed types. All items have to be same class
- `vector()` to create vectors
- List
- can contain different types
- `x<-list(1, "a", TRUE)`
- [[]] -elements are marked with double brackets
Atomic classes
- R has 5 atomic classes
- Character (also strings)
- numeric (real numbers) (64bit)
- Integer
- complex `2+4i`
- Logical (`TRUE`/`FALSE`)
- Integer
- Suffix L `1L` will give an integer
- otherwise R makes it double precision
- 2 is numeric 2L is integer
- Integers are numerics not vice versa
- Numeric (real/float)
- Double precision
- Inf - infinity
- NaN - not a number
Attributes
- objects have attributes
- names, dimnames, dimensions, class, length
Matrix (it’s a type of vector)
- Also can be of only one class
`m<-matrix(1:6, nrow=2, ncol=3)`
- `rbind()` - adds rows
- `cbind()` -add columns
- Both command fill column first unless byrow=TRUE
- "recycling" happens if size is not appropriate
Factor -
- Categorical data
- Factors can be Ordered or unordered(Nominal or Ordinal)
- `factor(c("yes", "yes", "no"), levels=c("yes", no))` <- create a factor list
- greater or less works with ordered
- Levels=c() option inside factor command lets you order
- Can be thought of integer values with "labels"
- First level it encouters is the "baseline level"
- Attributes
- `table()` - Gives a frequency table
- `unclass()` - Strips the class - shows the integer
- levels parameter in factor() can be used to set the order of the labels
Missing Values
- NA - undefined mathematical operations
- Can have classes - there are Integer NAs , char NA etc.
- NaN - numeric missing value
- NaN is also NA but not vice versa
- `good_df <- complete.cases(df)`
Data frame
- Key data type - tabular data
- Special type of list - each element of the list has the same length
- Each element of the list is a column and the length of the element is the number of rows
- Can store columns of different classes
- `row.names` attribute - each row can have a name
- can be imported by using `read.table()` or `read.csv()`
- Coorced to matrix by calling `data.matrix()`
`x<-data.frame(foo=1:4, bar=c(T,T,F,F))`
Reading stuff
- `read.table` and `read.csv` (kinda like pd.read_csv)
- `source` - like import in python (inverse of `dump`)
- `readLines` read lines of a text document
- `dget` also for r source code(inverse of `dput`)
- `load` binary
- other connection interface
- `file`
- `gzfile`
- `bzfile`
- `url`
Reading large data (tips)
- set `comment.char =""` if there are no commented lines
- specify `colClasses`
- use `nrows` to read only some rows
`dump`ing & `dput`ing
- Textual output which is basically r-code
- Metadata is not lost with either
Subsetting (aka indexing in python)
- STARTS WITH 1 not 0 !!!!
- `[`
- some differences `a[,2]` is equivalent to `a[:,2]` in python
- can be negative or logical eg. `x[!is.na(x)]`
- `[[`
- for data frame get one column
- `
- for matching the name of the column (its awesome)
- `x$a` can give same answer as `x[["aadvark"]]`
Vectorization and Broadcasting
- works like python. Broadcasting is called *"recycling"*
Control Structures
- `if(condition){}else{}`
- `y <- if(condition{value}else{value}` assign conditionally
- `for`
- `for(i in 1:4){}`
- `for(i in seq_along(vector)){}`
- `for(i in seq_len(nrow(x))){}`
- `for(element in vector){}`
- `while(i < 10){}`
- similar to C
- `repeat{}` - infinite loop
- use `break` to exit
- `next`
- Like `continue` in python/C
`with`
https://stackoverflow.com/questions/42283479/when-to-use-with-function-and-why-is-it-good
## Functions
```R
add <- function(x,y){
x + y
}
```
- Last value is returned like Scheme or explicit `return`
- Can be passed to other functions
- Arguments can have default values like python
- functions can be called with positional arguments later than named arguments unlike python. It confusing as hell eg. `lm(data, y-x, model=FALSE, 1:100)`
- arguments can also be partial matched
- Lexical scoping aka static scoping
- like python javascript
- "Free variables" are searched in the environment in which the function was defined
- Lazy evaluation
- R won't complain for missing arguments if it is not used within the functions (unlike python)
- `…` used to get variable number of arguments
- Namespace
- type `search()` to see the order in which the namespace is looked at.
- R have separate namespace for variables and functions. For example variable c does not interfere with function c
Date and Time
- Dates are represented by Date class(days since 1970-01-01)
- `as.Date("2021-02-02")`
- Times are represented by `POSIXct` Or `POSIXlt` class
- `Sys.time()`
- `as.POSIXlt(Sys.time())$s` - give current second
- `POSIXct` - stored as a very large integer under the hood
- `POSIXlt` - stored as a list
- `strptime()` - converts text to POSIXlt
- `x <- strptime(c("Jan 10, 2021 10:40" , "Jan 20, 2022 3:23"), "%B %d, %Y %H:%M")`
- `?strptime` for formats
Loop Functions
- `lapply(list, function, …)`
- Anonymous function - on the fly function like lambda fn eg. `function(var) var[,1]`
- returns a `list` (not a vector)
- `...` can pass arguments
- `sapply`
- It tries to convert the lapply output to a vector or matrix
- closer equivalent to `map` in scheme
- `apply(array, margin, function,…)`
- `array` is a vector or like
- `margin` is a integer vector indicating which margins should be "retained" i.e like `axis` in pandas 1- collapse rows 2 - collapse columns
- Kinda like a `reduce`
- typically used to apply function to rows or columns of a matrix
- Apply is not faster than a for loop just looks concise
- `rowSums`,` rowMeans`, `colSums`, `colMeans` - lot faster than apply mean
- `mapply(function, ... , MoreArgs = NULL, SIMPLIFY=TRUE`
- Multivariate apply - apply the function on a bunch of arrays
- `...` has the arrays to iterate over
- Eg. `mapply(rep, 1:4, 2:5)`
- `tapply(vector, INDEX, function, … simplify =TRUE)`
- `INDEX` is a vector of same dimensions as `vector` . It is categorical telling which category each element of the vector belongs to.
- `function` is applied to the vector
- `INDEX` is a factor variable that can be generated using gl()
- `split`
- It splits a vector into a list based on a factor list (first step of `tapply`)
Debug commands
- three levels of information
- `message` - some info
- `warning` -
- `error`- stop execution of function
- `condition` - something that triggers the information
-`traceback()` - after error happened
- Prints function call stack
- `debug(function)` then execute function line by line
- n for next
-` browser()` - puts into debug mode once encountered in code
- `trace()` - insert debug code without editing function
Random Numbers
- `r` - generate random
- `d` - evaluate probability density
- `p` - evaluate cumulative density
- `q` - for quantile (inverse of p)
- Types
- `norm`
- `pois`
- `gamma`
- `unif`
- example `rnorm(100)`
- `set.seed()` - used generate the same random variables
- `sample()` - sample from a vector