# R code: Bee Virus Project F16 Data Analysis
# Contains code for t-test, anova, and linear regression
# The hashtag before a line indicates a comment. Lines without hashtags will be read as code by the R compiler.

# We'll use R (The R Foundation For Statistical Computing at http://www.R-project.org) to assess significant differences between our treatment levels. R is a very powerful, flexible, and free programming platform.

# Before analysis on any platform, raw data need to be organized appropriately for the specific analysis. Note that a categorical variable entry cannot be numeric in R, so convert numbers to letters if you have coded your treatments as 1,2,3. Weâ€™ll start with three sample datasets.

# Between analyses, you can cancel Râ€™s memory of any objects, etc, but quitting and restarting. If R gets stuck in general, this is always an option. The quit command is q().

# In R, we need to tell R where your data file is located by setting the working directory to point to the data folder. You can do this using the R â€œMiscâ€ dropdown menu or using code. The code version follows. On a PC, select File, Change dir. 
getwd()
# to see what the current working directory is

setwd("<filetype>")
# changes to the correct location for data file storage. For Chrissyâ€™s macbook laptop, use setwd("/Users/chrissyspencer/Desktop/Rdata"). Note the / and straight, not curly, quotes.

# Entering Data: Upload or enter your data into R in a variety of ways. You can simply create a vector of variables using the c(type your data numbers here, comma-separated), or you can upload a file. Both methods are coded below.

# vector option
# trt1 <- c(type your data numbers here, comma-separated)

# file upload option. Save your data in the format you want as a Windows Comma Separate .csv (on a Mac), or just a normal .csv (on a PC). Load in the data file using the read.csv() command, putting the file name in the (). Assign it to an object called â€œdataâ€ that we can call for subsequent analysis.

 
data <- read.csv("tdata.csv", header = TRUE, sep = ",")


# read.csv(file, header = TRUE, sep = ",", quote = "\"", dec = ".", fill = TRUE, comment.char = "", ...)

# entering an objectâ€™s name, here â€œdata,â€ will print the object. Use this to confirm that you have input the correct data for analysis.
data

# provides descriptive stats
summary(data)




# Assign the column headers as â€œobjectsâ€ in so R knows where to find your data.
attach(data)
names(data)

# A t-test is a  a statistical technique that can be used to compare two groups or categories of data.  The code below compares the means and variances of group x to group y. The null hypothesis being tested is that the means of the groups are the same, while the alternative can be defined as the means are different (two-sided), that x is greater than y (greater), or that x is less than y (less). 

t.test(x,y)





# Analysis of Variance (ANOVA) is a statistical technique that can be used when the explanatory variables are categorical (instead of continuous). In our model, we have treatment levels in column x and a response variable in column y. We want to know if the average effects of each treatment level are different from each other, and if those different are relevances, or statistically significant.

data <- read.csv("aovdata.csv", header = TRUE, sep = ",")
data
summary(data)
attach(data)
names(data)
a1<-aov(y~x)
summary(a1)

trt<-table(x)
barplot(trt)

# We can see which values look different, but we need a statistical method to determine which pairs are significantly different from each other. These comparisons are made after the initial planned ANOVA test, and so are called "post-hoc comparisons." Specifically, in this case we'll need pair-wise comparisons. However, this will require three tests, one for each pairwise comparison, so we need to alter our standard of statistical significance (p<0.05) to account for making multiple tests, or comparisons between these three groups. 

# The first option is to complete a simple t-test but with a downward adjusted p-value. This is called a Bonferroni comparison.

pairwise.t.test(y, x, p.adj = "bonf")

# The Bonferroni output shows a pairwise table of p-values. Take note of which pairs are different. 

# A more rigorous option decreases the p-value more dramatically. This is a Tukey's Highly Significant Difference test.

TukeyHSD(a1)

# Now the table is more complex, but the significance of each pair is given in the rightmost column. Compare these to the Bonferroni corrected values. 


# Plot your data.
attach(data)
names(data)
plot(x,y)

# Two-Way ANOVA

a1<-aov(y~x1*x2)
summary(a1)

# Linear Regression is used when the response variable (y) and the explanatory variable (x) are both continuous. The simple model is y = a + bx, where a is the intercept and b is the slope of regression line. The line itself is a best fit model called the â€œleast-squares estimateâ€ and is found by placing the line to minimize its distance to the data points. To determine fit, appropriateness, and statistical significance, weâ€™ll follow three steps: 1) fit a linear model and determine the goodness of fit using an F statistic with an accompanying p-value, 2) estimate how unreliable the values a and b from the linear model are using standard errors, where the â€œsummary()â€ command to obtain the coefficient estimates and standard error of a and b by examining residualsâ€¦, and 3) determine the proportion of the variation in the data explained by the model, using the r-squared value.


data <- read.csv("regdata.csv", header = TRUE, sep = ",")
data
summary(data)
attach(data)
names(data)

plot(x,y)
abline(r1<-lm(y~x))


r1<-lm(y~x)
summary(r1)
# p-value indicates in that factor 


plot(r1)
# This command generates 4 plots that allow visual assessment that the data meet a key assumptions of constant variance and normal distribution of residuals of the regression model. Look for pure scatter in the residuals versus fitted plot and a linear fit in the normal Q-Q plot. If instead the data show increasing scatter with larger values of x or a non-linear fit, this linear regression isnâ€™t the best statistical approach for this particular dataset.