Tuesday, October 14, 2014

Operate on the body of a file but not the header

Sometimes you need to run some UNIX command on a file but only want to operate on the body of the file, not the header. Create a file called body somewhere in your $PATH, make it executable, and add this to it:
#!/bin/bash
IFS= read -r header
printf '%s\n' "$header"
eval $@
Now, when you need to run something but ignore the header, use the body command first. For example, we can create a simple data set with a header row and some numbers:
$ echo -e "header\n1\n5\n4\n7\n3"
header
1
5
4
7
3
We can pipe the whole thing to sort:
$ echo -e "header\n1\n5\n4\n7\n3" | sort
1
3
4
5
7
header
Oops, we don’t want the header to be included in the sort. Let’s use the body command to operate only on the body, skipping the header:
$ echo -e "header\n1\n5\n4\n7\n3" | body sort
header
1
3
4
5
7
Sure, there are other ways to solve the problem with sort, but body will solve many more problems. If you have multiple header rows, just call body multiple times.
Inspired by this post on Stack Exchange.

Tuesday, September 16, 2014

R package to convert statistical analysis objects to tidy data frames

I talked a little bit about tidy data my recent post about dplyr, but you should really go check out Hadley’s paper on the subject.
R expects inputs to data analysis procedures to be in a tidy format, but the model output objects that you get back aren’t always tidy. The reshape2, tidyr, and dplyr are meant to take data frames, munge them around, and return a data frame. David Robinson's broom package bridges this gap by taking un-tidy output from model objects, which are not data frames, and returning them in a tidy data frame format.
(From the documentation): if you performed a linear model on the built-in mtcars dataset and view the object directly, this is what you’d see:
lmfit = lm(mpg ~ wt, mtcars)
lmfit
Call:
lm(formula = mpg ~ wt, data = mtcars)

Coefficients:
(Intercept)           wt  
     37.285       -5.344
summary(lmfit)
Call:
lm(formula = mpg ~ wt, data = mtcars)

Residuals:
   Min     1Q Median     3Q    Max 
-4.543 -2.365 -0.125  1.410  6.873 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   37.285      1.878   19.86  < 2e-16 ***
wt            -5.344      0.559   -9.56  1.3e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.05 on 30 degrees of freedom
Multiple R-squared:  0.753,  Adjusted R-squared:  0.745 
F-statistic: 91.4 on 1 and 30 DF,  p-value: 1.29e-10
If you’re just trying to read it this is good enough, but if you’re doing other follow-up analysis or visualization, you end up hacking around with str() and pulling out coefficients using indices, and everything gets ugly quick.
But the tidy function in the broom package run on the fit object probably gives you what you were looking for in a tidy data frame:
tidy(lmfit)
         term estimate stderror statistic   p.value
1 (Intercept)   37.285   1.8776    19.858 8.242e-19
2          wt   -5.344   0.5591    -9.559 1.294e-10
The tidy() function also works on other types of model objects, like those produced by glm() and nls(), as well as popular built-in hypothesis testing tools like t.test(), cor.test(), or wilcox.test().
View the README on the GitHub page, or install the package and run the vignette to see more examples and conventions.

Thursday, September 11, 2014

UVA / Charlottesville R Meetup

TL;DR? We started an R Users group, awesome community, huge turnout at first meeting, lots of potential.

---

I've sat through many hours of meetings where faculty lament the fact that their trainees (and the faculty themselves!) are woefully ill-prepared for our brave new world of computing- and data-intensive science. We've started to address this by running annual Software Carpentry bootcamps (March 2013, and March 2014). To make things more sustainable, we're running our own Software Carpentry instructor training here later this month, where we'll train scientists how to teach other scientists basic computing skills like using UNIX, programming in Python or R, version control, automation, and testing. I went through this training course online a few months ago, and it was an excellent introduction to pedagogy and the psychology of learning (let's face it, most research professors were never taught how to teach; it's a skill you learn, not one you inherit with the initials behind your name).

Something that constantly comes up in these conversations is how to promote continued learning and practice after the short bootcamp is over. Students get a whirlwind tour of scientific computing skills over two days, but there's generally very little follow-up that's necessary to encourage continued practice and learning.

At the same time we've got a wide variety of scientists spanning all disciplines including social sciences, humanities, medicine, physics, chemistry, math, and engineering that are doing scientific computing and data analysis on a daily basis who could really benefit from learning from one another.

These things motivated us to start a local R Users Group. So far we have 118 people registered on Meetup.com, and this week we had an excellent turnout at our first meeting, with over 70 people who RSVP'd.


At this first meetup we kicked things off with an introduction to the group, why we started it, and our goals. I then launched into a quick demo of some of the finer features in the dplyr package for effortless advanced data manipulation and analysis. The meetup group has a GitHub repository where all the code from our meetups will be stored. Finally, we concluded with a discussion of topics the group would like to see presented in the future: ggplot2, R package creation, reproducible research, dynamic documentation, and web scraping were a few of the things mentioned. We collectively decided that either talks could be either research talks highlighting how these things were used in an actual research project, or they could be demo/tutorial in nature, like the dplyr talk I gave.

The rich discussion we had at the end of this session really highlighted the potential this community has. In addition to the diversity and relative gender-balance we saw in our first meetup's participants, we had participants from all over UVA as well as representation from local industry and non-profit organizations.

I, for one, am tremendously excited about this new community, and will post the occasional update from the group here on this blog.


Wednesday, September 10, 2014

GNU datamash

GNU datamash is a command-line utility that offers simple calculations (e.g. count, sum, min, max, mean, stdev, string coalescing) as well as a rich set of statistical functions, to quickly assess information in textual input files or from a UNIX pipe. Here are a few examples:
E.g., let’s use the seq command to generate some data, and use datamash sum to add up all the values in the first column:
$ seq 5
1
2
3
4
5
$ seq 5 | datamash sum 1
15
What else? Let’s calculate the mean, 1st quartile, median, 3rd quarile, IQR, sample-standard-deviation, and p-value of Jarque-Bera test for normal distribution, using some data in file.txt:
$ cat file.txt | datamash -H mean 1 q1 1 median 1 q3 1 iqr 1 sstdev 1 jarque 1
mean(x)   q1(x)  median(x)  q3(x)   iqr(x)  sstdev(x)  jarque(x)
45.32     23     37         61.5    38.5    30.4487    8.0113e-09
Go take a look at more examples, specifically the examples using datamash to parse a GTF file. Datamash can very quickly do things like find the number of isoforms per gene by grouping (collapsing) over gene IDs and counting the number of transcripts, find the number of genes with more than 5 isoforms, find genes transcribed from multiple chromosomes, examine variability in exon counts per gene, etc.

Wednesday, August 20, 2014

Do your "data janitor work" like a boss with dplyr

Data “janitor-work”

The New York Times recently ran a piece on wrangling and cleaning data:
Whether you call it “janitor-work,” wrangling/munging, cleaning/cleansing/scrubbing, tidying, or something else, the article above is worth a read (even though it implicitly denigrates the important work that your housekeeping staff does). It’s one of the few “Big Data” pieces that truly appreciates what we spend so much time doing every day as data scientists/janitors/sherpas/cowboys/etc. The article was chock-full of quotable tidbits on the nature of data munging as part of the analytical workflow:
It’s an absolute myth that you can send an algorithm over raw data and have insights pop up…
Data scientists … spend 50-80% of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.
But if the value comes from combining different data sets, so does the headache… Before a software algorithm can go looking for answers, the data must be cleaned up and converted into a unified form that the algorithm can understand.
But practically, because of the diversity of data, you spend a lot of your time being a data janitor, before you can get to the cool, sexy things that got you into the field in the first place.
As data analysis experts we justify our existence by (accurately) evangelizing that the bottleneck in the discovery process is usually not data generation, it’s data analysis. I clarify that point further with my collaborators: data analysis is usually the easy part — if I give you properly formatted, tidy, and rigorously quality-controlled data, hitting the analysis “button” is usually much easier than the work that went into cleaning, QC’ing, and preparing the data in the first place.
To that effect, I’d like to introduce you to a tool that recently made its way into my data analysis toolbox.

dplyr

Unless you’ve had your head buried in the sand of the data analysis desert for the last few years, you’ve definitely encountered a number of tools in the “Hadleyverse.” These are R packages created by Hadley Wickham and friends that make things like data visualization (ggplot2), data management and split-apply-combine analysis (plyr, reshape2), and R package creation and documentation (devtools, roxygen2) much easier.
The dplyr package introduces a few simple functions, and integrates functionality from the magrittr package that will fundamentally change the way you write R code.
I’m not going to give you a full tutorial on dplyr because the vignette should take you 15 minutes to go through, and will do a much better job of introducing this “grammar of data manipulation” than I could do here. But I’ll try to hit a few higlights.

dplyr verbs

First, dplyr works on data frames, and introduces a few “verbs” that allow you to do some basic manipulation:
  • filter() filters rows from the data frame by some criterion
  • arrange() arranges rows ascending or descending based on the value(s) of one or more columns
  • select() allows you to select one or more columns
  • mutate() allows you to add new columns to a data frame that are transformations of other columns
  • group_by() and summarize() are usually used together, and allow you to compute values grouped by some other variable, e.g., the mean, SD, and count of all the values of $y separately for each level of factor variable $group.
Individually, none of these add much to your R arsenal that wasn’t already baked into the language in some form or another. But the real power comes from chaining these commands together, with the %>% operator.

The %>% operator: “then”

This dataset and example was taken directly from, and is described more verbosely in the vignette. The hflights dataset has information about more than 200,000 flights that departed Houston in 2011. Let’s say we want to do the following:
  1. Use the hflights dataset
  2. group_by the Year, Month, and Day
  3. select out only the Day, the arrival delay, and the departure delay variables
  4. Use summarize to calculate the mean of the arrival and departure delays
  5. filter the resulting dataset where the arrival delay or the departure delay is more than 30 minutes.
Here’s an example of how you might have done this previously:
filter(
  summarise(
    select(
      group_by(hflights, Year, Month, DayofMonth),
      Year:DayofMonth, ArrDelay, DepDelay
    ),
    arr = mean(ArrDelay, na.rm = TRUE),
    dep = mean(DepDelay, na.rm = TRUE)
  ),
  arr > 30 | dep > 30
)
Notice that the order that we write the code in this example is inside out - we describe our problem as: use hflights, then group_by, then select, then summarize, then filter, but traditionally in R we write the code inside-out by nesting functions 4-deep:
filter(summarize(select(group_by(hflights, ...), ...), ...), ...)
To fix this, dplyr provides the %>% operator (pronounced “then”). x %>% f(y) turns into f(x, y) so you can use it to rewrite multiple operations so you can read from left-to-right, top-to-bottom:
hflights %>%
  group_by(Year, Month, DayofMonth) %>%
  select(Year:DayofMonth, ArrDelay, DepDelay) %>%
  summarise(
    arr = mean(ArrDelay, na.rm = TRUE),
    dep = mean(DepDelay, na.rm = TRUE)
  ) %>%
  filter(arr > 30 | dep > 30)
Writing the code this way actually follows the order we think about the problem (use hflights, then group_by, then select, then summarize, then filter it).
You aren’t limited to using %>% to only dplyr functions. You can use it with anything. E.g., instead of head(iris, 10), we could write iris %>% head(10) to get the first ten lines of the built-in iris dataset. Furthermore, since the input to ggplot is always a data.frame, we can munge around a dataset then pipe the whole thing into a plotting function. Here’s a simple example where we take the iris dataset, then group it by Species, then summarize it by calculating the mean of the Sepal.Length, then use ggplot2 to make a simple bar plot.
library(dplyr)
library(ggplot2)
iris %>%
  group_by(Species) %>%
  summarize(meanSepLength=mean(Sepal.Length)) %>%
  ggplot(aes(Species, meanSepLength)) + geom_bar(stat="identity")
Once you start using %>% you’ll wonder to yourself why this isn’t a core part of the R language itself rather than add-on functionality provided by a package. It will fundamentally change the way you write R code, making it feel more natural and making your code more readable. There's a lot more dplyr can do with databases that I didn't even mention, and if you're interested, you should see the other vignettes on the CRAN package page.
As a side note, I’ve linked to it several times here, but you should really check out Hadley’s Tidy Data paper and the tidyr package, vignette, and blog post.

Monday, July 7, 2014

Introduction to R for Life Scientists: Course Materials

Last week I taught a three-hour introduction to R workshop for life scientists at UVA's Health Sciences Library.



I broke the workshop into three sections:

In the first half hour or so I presented slides giving an overview of R and why R is so awesome. During this session I emphasized reproducible research and gave a demonstration of using knitr + rmarkdown in RStudio to produce a PDF that can easily be recompiled when data updates.

In the second (longest) section, participants had their laptops out with RStudio open coding along with me as I gave an introduction to R data types, functions, getting help, data frames, subsetting, and plotting. Participants were challenged with an exercise requiring them to create a scatter plot using a subset of the built-in mtcars dataset.

We concluded with an analysis of RNA-seq data using the DESeq2 package. We started with a count matrix and a metadata file (the modENCODE pasilla knockout data packaged with DESeq2), imported the data into a DESeqDataSet object, ran the DESeq pipeline, extracted results, and did some basic visualization (MA-plots, PCA, volcano plots, etc). A future day-long course will cover RNA-seq in more detail (intro UNIX, alignment, & quantitation in the morning; intro R, QC, and differential expression analysis in the afternoon).

I wrote the course materials using knitr, rendered using Jekyll, hosted as a GitHub project page. The rendered course materials can be found at the link below, and the source is on GitHub.

Course Materials: Introduction to R for Life Scientists

Slides:



Cheat Sheet:





Tuesday, June 24, 2014

Bedtools tutorial from 2013 CSHL course

A couple of months ago I posted about how to visualize exome coverage with bedtools and R. But if you're looking to get a basic handle on genome arithmetic, take a look at Aaron Quinlan's bedtools tutorials from the 2013 CSHL course. The tutorial uses data from the Maurano et al exploration of DnaseI hypersensitivity sites in hundreds of primary tissue types (Science 337:1190-1195).

The tutorial provides examples with pictures and code to do things like:

Intersections to find features that overlap (or do NOT overlap):



Merging features like {ChIP,MEDIP,DNAse,*}-seq peaks:



Examining coverage:



Advanced usage using the Jaccard statistic to measure similarity of all 20 tissue types to all other 20 20 tissues, and visualizing this with principal components analysis and ggplot2 in R:



See the full bedtools documentation for more.

2013 CSHL bedtools tutorial: http://quinlanlab.org/tutorials/cshl2013/bedtools.html

Friday, June 13, 2014

An Annotated Online Bioinformatics / Computational Biology Curriculum

Two years ago David Searls published an article in PLoS Comp Bio describing a series of online courses in bioinformatics. Yesterday, the same author published an updated version, "A New Online Computational Biology Curriculum," (PLoS Comput Biol 10(6): e1003662. doi: 10.1371/journal.pcbi.1003662).

This updated curriculum has a supplemental PDF describing hundreds of video courses that are foundational to a good understanding of computational biology and bioinformatics. The table of contents embedded into the PDF's metadata (Adobe Reader: View>Navigation Panels>Bookmarks; Apple Preview: View>Table of Contents) breaks the curriculum down into 11 "departments" with links to online courses in each subject area:

  1. Mathematics Department
  2. Computer Science Department
  3. Data Science Department
  4. Chemistry Department
  5. Biology Department
  6. Computational Biology Department
  7. Evolutionary Biology Department
  8. Systems Biology Department
  9. Neurosciences Department
  10. Translational Sciences Department
  11. Humanities Department
Listings in the catalog can take one of three forms: Courses, Current Topics, or Seminars. All listed courses are video-based and free of charge, many being MOOCs offered by Coursera or edX.

More than just a link dump, the author of this paper has "road tested" most of the courses, having enrolled in up to a dozen at a time. Under each course listing the author offers commentary on the importance of the subject to a computational biology education and an opinion on the quality of instruction. (The author ranks in the top 50 students on Coursera in terms of the number of completed courses on Coursera!). For the courses that the author completed, listings have an "evaluation" section, which ranks the course in difficulty, time requirements, lecture/homework effectiveness, assessment quality, and overall opinions. The author also injects autobiographical annotations as to why he thinks the courses are useful in a bioinformatics career. 

In summary, years of work went into creating this annotated course catalog, and is a great resource if you're looking to get started in a computational biology career, or if you're just looking to brush up on individual topics ranging from natural language processing to evolutionary theory.


Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.