Wednesday, February 1, 2017

Staying Current in Bioinformatics & Genomics: 2017 Edition

A while back I wrote this post about how I stay current in bioinformatics & genomics. That was nearly five years ago. A lot has changed since then. A few links are dead. Some of the blogs or Twitter accounts I mentioned have shifted focus or haven’t been updated in years (guilty as charged). The way we consume media has evolved — Google thought they could kill off RSS (long live RSS!), there are many new literature alert services, preprints have really taken off in this field, and many more scientists are engaging via social media than before.

People still frequently ask me how I stay current and keep a finger on the pulse of the field. I’m not claiming to be able to do this well — that’s a near-impossible task for anyone. Five years later and I still run our bioinformatics core, and I’m still mostly focused on applied methodology and study design rather than any particular phenotype, model system, disease, or specific method. It helps me to know that transcript-level estimates improve gene-level inferences from RNA-seq data, and that there’s software to help me do this, but the details underlying kmer shredding vs pseudoalignment to a transcriptome de Bruijn graph aren’t as important to me as knowing that there’s a software implementation that’s well documented, actively supported, and performs well in fair benchmarks. As such, most of what I pay attention to is applied/methods-focused.

What follows is a scattershot, noncomprensive guide to the people, blogs, news outlets, journals, and aggregators that I lean on in an attempt to stay on top of things. I’ve inevitably omitted some key resources, so please don’t be offended if you don’t see your name/blog/Twitter/etc. listed here (drop a link in the comments!). Whatever I write here now will be out of date in no time, so I’ll try to write an update post every year instead of every five.

Twitter

In the 2012 post I ended with Twitter, but I have to lead with it this time. Twitter is probably my most valuable resource for learning about the bleeding-edge developments in genomics & bioinformatics. It’s great for learning what’s new and contributing to the dialogue in your field, but only when used effectively.

I aggressively prune the list of people I follow to keep what I see relevant and engaging. I can tolerate an occasional digression into politics, posting pictures of you drinking with colleagues at a conference, or self-congratulatory announcements. But once these off-topic Tweets become the norm, I unfollow. I also rely on the built-in list feature. I follow a few hundred people, but I only add a select few dozen to a “notjunk” list that I look at when I’m short on time. Folks in this list don’t Tweet too often and have a high signal-to-noise ratio (as far as what I’m interested in reading). If I don’t get a chance to catch up on my entire timeline, I can at least breeze through recent Tweets from folks on this list.

I’m also wary of following extremely prolific users. For example — if someone’s been on Twitter less than a year, already has 20,000 Tweets, but only 100 followers, it tells me they’ve got a lot to say but nobody cares. I let the hive mind work for me in this case, using this Tweet-to-follower ratio as sort of a proxy for signal-to-noise.

I mostly follow individuals and aggregators, but I also follow a few organization accounts. These can be a mixed bag. Only a few organization accounts do this well, delivering interesting and applicable content to a targeted audience, while many more are poor attempts at marketing and self-promotion while not offering any substantive value or interesting content.

Individuals: In no particular order, here’s an incomplete list of people who Tweet content that I find consistently on-topic and interesting.

Aaron Quinlan (aaronquinlan)
Adam Phillippy (aphillippy)
Andrew Severin (isugif)
Casey Greene (GreeneScientist)
Clive Brown (Clive_G_Brown)
Dan MacArthur (dgmacarthur)
David Robinson (drob)
Elisabeth Bik (MicrobiomDigest)
Frank Harrell (f2harrell)
Hadley Wickham (hadleywickham)
Heng Li (lh3lh3)
James Hadfield (coregenomics)
Jared Simpson (jaredtsimpson)
Jeff Leek (jtleek)
Jenny Bryan (JennyBryan)
Julia Silge (juliasilge)
Krista Ternus (KristaTernus)
Lex Nederbragt (lexnederbragt)
Lior Pachter (lpachter)
Mick Watson (biomickwatson)
Mike Love (mikelove)
Nick Loman (pathogenomenick)
Nicolas Robine (notSoJunkDNA)
Phil Ashton (flashton2003)
RNA-seq Blog (rnaseqblog)
Rob Patro (nomad421)
Roger Peng (rdpeng)
Sam Minot (sminot)
Sean Davis (seandavis12)
Titus Brown (ctitusbrown)
Torsten Seemann (torstenseemann)
Tuuli Lappalainen (tuuliel)
Vince Buffalo (vsbuffalo)
Willem van Schaik (WvSchaik)
Zamin Iqbal (ZaminIqbal)
Many more I’m failing to specifically mention…

Others: Besides individual accounts, there are also a number of aggregators and organizations that I keep on a high signal-to-noise list.

bioRxiv (biorxivpreprint)
bioRxiv Bioinfo (biorxiv_bioinfo)
bioRxiv Genomics (biorxiv_genomic)
Metagenomics Papers (metagenomic_lit)
InformaticsGW (UduakGW)
Hacker News 300 (newsyc300)
CompBiolPapers (compbiolpapers)
RNA-seq paper aggregator (RNA_seq)
Bioconductor (Bioconductor)
RStudio Tips (rstudiotips)

Blogs

I follow these and other blogs using RSS. I’ve been happy with the free version of Feedly ever since Google Reader was killed. The web interface and iOS app have everything I need, and they both integrate nicely with other services like Evernote, Instapaper, Buffer, Twitter, etc. If you can’t find a direct link to the blog’s RSS feed, you can usually type the name of the blog into Feedly’s search bar and it’ll find it for you. Similar to my “notjunk” list in Twitter, I have a Favorites category in Feedly where I include only the feeds I absolutely wouldn’t want to miss.

These are some of the few that I try to read whenever something new is posted, and Feedly helps me keep those organized, either by “starring” something I want to come back to, or saving it for later with Instapaper. They’re in no particular order, and I’m sure I’ve forgotten something.

Variance Explained: David Robinson’s blog (Data Scientist at Stack Overflow, works in R and Python).
Global Biodefense: News on pathogens, outbreaks, and preparedness, with periodic posts on genomics and bioinformatics-related developments and funding opportunities.
In between lines of code: Lex Nederbragt’s blog on biology, sequencing, bioinformatics, …
Simply Statistics: A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek.
Bits of DNA: Reviews and commentary on computational biology by Lior Pachter (fair warning: dialogue here can get a bit heated!).
Blue Collar Bioinformatics: articles related tool validation and the open source bioinformatics community.
Microbiome Digest - Bik’s Picks: A daily digest of scientific microbiome papers, by Elisabeth Bik, Science Editor at uBiome.
Living in an Ivory Basement: Titus Brown’s blog on metagenomics, open science, testing, reproducibility, and programming.
Enseqlopedia: James Hadfield’s blog on all things NGS.
Epistasis Blog: Jason Moore’s computational biology blog.
RStudio Blog: announcements about new RStudio functionality, updates about the tidyverse, and more.
nextgenseek.com: Next-Gen Sequencing Blog covering new developments in NGS data & analysis.
RNA-Seq Blog: Transcriptome Research & Industry News.
The Allium: We all need a little humor in our lives. Like The Onion, but for science.

Others

I’m unsure how to categorize the rest. These are things like aggregators, Q&A sites/forums, and others.

Nuzzel is something I’ve only been using for a few months but it works very well. It’s meant to solve the Twitter / social media overload problem. If you’re following a few hundred people, you could easily have thousands of Tweets per day to read through (or miss). Nuzzel emails you a daily newsletter of the most relevant content in your Twitter feed. I’m guessing it does this by analyzing how many people you follow share, retweet, or favorite the same links. I try to read everything in my RSS feeds but I could never do this with Twitter (nor should you worry about trying). Nuzzel helps you catch up on things that are trending among the people you follow. It’s not a substitute for following the right people (see the Twitter section above).
RWeekly: weekly updates from the entire R community. Offers an RSS feed but I subscribe to the weekly email. Each email sends out about 50 links with one-sentence descriptions to things being done in the R community that week.
R Bloggers aggregates RSS feeds from hundreds of blogs about R. Much more comprehensive than RWeekly, but lots to sort through.
GenomeWeb still provides high-quality original content as well as summaries of what’s going on in the field. Create an account, log in, view your profile page, and subscribe to some of their regular emails. I subscribe to their daily news, the scan, informatics, sequencing, and infectious diseases bulletins. Pro tip: Much of their content is only available for premium subscribers. If you sign up with a .edu address, you can access all this content for free.
F1000’s Smart Search is one of the few literature recommendation services that I find useful, relevant, and current. My RNA-seq and metagenomics alerts consistently deliver relevant and fresh content.
BioStars: This is a stack exchange Q&A site focused on bioinformatics, computational genomics, biological data analysis. You can go to the homepage and sort by topic, views, answers, etc., and the platform offers several granular ways to subscribe via RSS.
Bioconductor Support: This is a Q&A site much like BioStars that replaced the Bioconductor mailing list. You can do things like limit to a certain time period and sort by views, for example, if you only want to log in occasionally to see what’s being talked about.
SEQanswers: I subscribe to all new threads in the SEQanswers bioinformatics forum, and regularly browse post titles. When something sparks my interest, I’ll click into that post and subscribe to future updates on that post via email.
Google Scholar lets you search and create email alerts.
PubMed Alerts: You can save, automate, and have search results emailed to you through your MyNCBI account. Surprisingly, these seem to be more relevant than the Google Scholar searches for the terms that I use.
PubMed Trending - I have no idea how PubMed ranks these. It seemed to be more useful in the past, but now it seems that the top “trending” articles alternate between CRISPR/Cas9, and old kinesiology / sports medicine articles.
IFTTT: If This Then That is a service that connects many different web services together in an endless number of ways. At home I might connect Facebook and Dropbox, so that whenever someone tags me in a photo, that photo is automatically downloaded to my Dropbox. At work I can connect an RSS feed to an Evernote note or Google Doc. It’s useful is so many ways, both for personal and for work-related tasks. I mostly use it here as a last safeguard so that things I really shouldn’t miss don’t slip through the cracks. I have recipes that do things like email me if certain low-volume Twitter accounts post a new Tweet, others that automatically save to Instapaper things like starred articles in Feedly. I also use this to keep a close eye on a few accounts on GitHub. I have connections set up for a few users on GitHub so that whenever one of these users creates a new public repository, I get an email. I’ve also used IFTTT to archive Tweets coming out of various hashtags — you can create a recipe where if a new Tweet contains certain keywords or hashtags, then save that Tweet to Evernote, a shared Google Doc spreadsheet, etc. Zapier is a similar service that I’ve heard provides more granular control, but I haven’t tried it.
Podcasts: I listen to every episode of Roger Peng and Hilary Parker’s Not So Standard Deviations data science podcast, and most episodes of Roger Peng and Elizabeth Matsui’s The Effort Report (this one’s more about life in academia in general). I use the Overcast iOS app to listen to these and other podcasts on ~1.75X speed. (When I met Hilary at the RStudio Conference I heard her speak for the first time at regular 1X speed. Odd experience.) Finally, I just learned about the R podcast. I haven’t listened to much yet, but I’ve added it to my long Overcast queue.

Preprints!

Preprints in life sciences were nearly unheard of when I wrote the 2012 post. Now everybody’s doing it. There are still a few people using the arXiv Quantitative biology channel, and I’ll occasionally find something in PeerJ Preprints that grabs my attention.

bioRxiv is the biggest player here, hands down. The Alerts/RSS page lets you sign up for email alerts on particular topics, or subscribe to RSS feeds coming from particular categories that interest you. I subscribe to the Genomics and Bioinformatics feeds. I also follow several of the bioRxiv’s top-level and category Twitter feeds @biorxivpreprint, @biorxiv_bioinfo, and @biorxiv_genomic).

F1000 Research deserves some special attention here. It’s somewhere in-between a preprint server and a peer-reviewed publication. You can upload manuscripts (or other research outputs like posters or slides), and they’re immediately and permanently published, and given a DOI. Then one or more rounds of open peer review as well as public comment take place, and authors can update the published paper for further review. Check out the transcript estimates / gene inference paper I mentioned earlier. You’ll see it’s “version 2,” and was approved by two referees. If you look at the right-hand panel, you can actually go back and see the prior to revision, as well as see who reviewed it, what the reviewer wrote, and how the authors responded to those reviews. It’s an innovative platform where peer review is open and transparent, and is independent of publication, since papers are published before they are reviewed, and remain regardless of the outcome of the review. F1000 Research has a number of channels that are externally curated by different organizations, societies, conferences, etc. I subscribe to and get alerts about the R package and Bioconductor channels. Whenever a new preprint is dropped into one of these channels, I’ll get an email and an RSS item.

I only recently discovered PrePubMed, which looks very useful. PrePubMed indexes preprints from arXiv q-bio, PeerJ Preprints, bioRxiv, F1000Research, preprints.org, The Winnower, Nature Precedings, and Wellcome Open Research. In the tools box on the homepage, you can enter a search string and get back an RSS feed with results from that search. It looks like PrePubMed is maintained by a single person, but he’s made the entire thing open source, so you could presumably set this up and mirror it on your own, should you check back in 2021 and the link be dead.

Journals

I started with Journals in my 2012 post, but they’re last (and probably least) here. I still subscribe to a few journals’ RSS feeds, but in most cases, by the time I see a new Table of Contents hit my RSS reader, I probably saw the publications making the rounds on Twitter, blogs, or other channels mentioned above. It’s also no longer unusual to see a “publication” land where I read the preprint on biorXiv months ago, and perhaps even a blog post before that! What “publication” means is changing rapidly, and I’m sure the lines between a blog post, preprint, and journal article will be even blurrier in the year 2022 post.

How do you have the time to do this?

How do you not? It’s not as bad as it seems. I probably spend an hour each weekday scanning all the resources mentioned here, and I find the time well spent. I can breeze through my Twitter and RSS feeds on my bus ride into work, and saving things I actually want to look at later with a bookmark, star, favorite, Instapaper, etc.

I should have prefaced this whole article with the note that I hardly ever actually fully read any of the papers or blog posts I see here. If I see, for example, a new WGS variant caller published, I’ll glance at the figures benchmarking it against GATK and FreeBayes, and skim through the documentation on the GitHub README or BioConductor vignette. If either of these is missing or falls short, that’s usually enough for me to ignore the publication completely (don’t underestimate the importance of good documentation!).

It’s taken me a decade to compile and continually hone this list of resources to the things that I find useful and relevant. This is what works for me, now, in 2017. It’s not a one-size-fits-all, and the 2018-me will probably have a somewhat different list, but I hope you’ll find it useful. If your interests are similar to what I’ve discussed here, how do you stay current? What have I left out? Let me know in the comments!

Saturday, January 14, 2017

RStudio Conference 2017 Recap

The first ever RStudio conference was held January 11-14, 2017 in Orlando, FL. For anyone else like me who spends hours each working day staring into an RStudio session, the conference was truly excellent. The speaker lineup was diverse and covered lots of areas related to development in R, including the tidyverse, the RStudio IDE, Shiny, htmlwidgets, and authoring with RMarkdown.

This is not a complete list by any means — with split sessions I could only go to half the talks at most. Here are some noncomprehensive notes and links to slides and resources for some of the awesome things are doing with R and RStudio that I learned about at the RStudio Conference.

Hadley Wickham kicked off the meeting with a keynote on doing data science in R. The talk focused on the tidyverse, and the notion of splitting functions into commands that do something, as compared to queries that calculate something, and how it’s generally a good idea to keep these different functionalties contained in their own separate functions. (Contrast this to things like lm that both computes values and does things, like printing those values to the screen, making it difficult to capture (see broom).

I asked Hadley after his talk about strategies to reduce issues getting Bioconductor data structures to play nicely with tidyverse tools. Within minutes David Robinson released a new feature in the fuzzyjoin package that leverages IRanges within this tidyverse-friendly package for efficiently doing things like joining on genomic intervals.

Another #rstudioconf-inspired addition to fuzzyjoin:

genome_join, for overlapping intervals on the same chromosome@genetics_blog #rstats pic.twitter.com/oUctyNYc09
— David Robinson (@drob) January 13, 2017

Charlotte Wickham’s 2-hour purrr tutorial was awesome. Here’s a link to a shared dropbox folder with code, challenges, slides, data, etc. The purrr package is a core package in the tidyverse, and I’ll be replacing many of the base ?apply and plyr ??ply functions that I still use here and there. The map_* functions are integral to working with nested list-columns in dplyr, and I think I’m finally starting to grok how to work with these.

Jenny Bryan gave a great talk on list columns. You can see her slides here. Jenny also put together this excellent tutorial with lots of worked examples and code snippets. And if you need some example list data structures for more practice or for teaching that aren’t foo/bar/iris/mtcars-level boring, see her repurrrsive package. Related to this, for more on list columns and purrr map functions, start reading at the “Many Models” section of Hadley’s R for Data Science book.

Julia Silge, data scientist at Stack Overflow, gave a great introduction to tidy text mining with R. You can read Julia and David’s Tidy Text Mining with R book here online (the book was authored in Rmarkdown using bookdown!).

Andrew Flowers, data journalist and former writer at FiveThirtyEight gave the second day’s keynote address on finding and telling stories using R. He gave a series of examples illustrating six motivating features that make data stories worth telling, along with potential danger inherent to each one:

Novelty (potential danger: triviality)
Outlier (spurious result; see also, p-hacking)
Archetype (oversimplification)
Trend (variance)
Debunking (confirmation bias)
Forecast (overfitting)

Yihui Xie led a two-hour tutorial on advanced RMarkdown. You can see his slides here. The rticles package has LaTeX Journal Article Templates for R Markdown for various journals. The tufte package now supports both PDF and HTML output. See an example here. Yihui’s xaringan package ports the remark.js library for slideshows into R. Careful. Yihui warns that you may not sleep after learning about how cool remark.js is. Yihui showed an early version of the in-development blogdown package that can build blog-aware static websites using the blazing-fast and well-documented Hugo static site generator. Finally, the bookdown package is just awesome. It takes multiple RMarkdown documents as input and renders into multiple output formats (screen-readable ebook, PDF, epub, etc.). It looks great for writing books and technical documentation with pushbutton publishing to multiple output formats with some nice built-in styles out of the box. Some examples:

bookdown.org/yihui/bookdown — The bookdown book, written in RMarkdown with bookdown. (whoa, meta)
r4ds.had.co.nz — Garrett Grolemund and Hadley Wickham’s R for Data Science book.
tidytextmining.com — Julia and David’s book on text mining
moderndive.com — an open-source introductory statistics class textbook

Finally, a few gems from other talks that I jotted down:

Chester Ismay gave a great talk on teaching introductory statistics using R, with the open-source course textbook written in RMarkdown using bookdown.
Bob Rudis talked about using pipes (%>%), and pipes within pipes, and best piping practices. See his slides here.
Hilary Parker talked about the idea of an analysis development, (and analysis developers), drawing similarities to software development/developers. Hilary discussed this once before on the excellent podcast that she and Roger Peng host, and you can probably find it in their Conversations On Data Science ebook that summarize and transcribe these conversations.
Simon Jackson introduced corrr package for exploring and manipulating correlations and correlation matrices in a tidy way.
Gordon Shotwell introduced the easymake package that generates Makefiles from a data frame using R.
Karthik Ram quickly introduced several of the (many) rOpenSci packages related to data publication, data access, scientific literature access, scalable & reproducible computing, databases, visualization, taxonomy, geospatial analysis, and many utility tools for data analysis and manipulation.

With split sessions I missed more than half the talks. Lots of people here are active on Twitter, and you can catch many more notes and tidbits on the #rstudioconf hashtag. The meeting was superbly organized, I learned a ton, and I enjoyed meeting in person many of the folks I follow on Twitter and elsewhere online. A few days of 80-degree weather in mid-January didn’t hurt either. I’ll definitely be coming again next year. Kudos to the rstudio::conf organizers and speakers!

All the talks were recorded and will supposedly find their way to rstudio.com at some point soon. I’ll update this post with a link when that happens.

Update Feb 16, 2017: All the talks have now been posted online here under the rstudio::conf2017 heading.

Day 1:

Day 2:

Monday, September 19, 2016

Primers in computational biology

I recently stumbled across this collection of computational biology primers in Nature Biotechnology. Many of these are old, but they're still great resources to get a fundamental understanding of the topic. Here they are in no particular order.

...

How does multiple testing correction work?
http://www.nature.com/nbt/journal/v27/n12/full/nbt1209-1135.html

What is principal component analysis?
http://www.nature.com/nbt/journal/v26/n3/full/nbt0308-303.html

SNP imputation in association studies
http://www.nature.com/nbt/journal/v27/n4/full/nbt0409-349.html

How does gene expression clustering work?
http://www.nature.com/nbt/journal/v23/n12/full/nbt1205-1499.html

What is a hidden Markov model?
http://www.nature.com/nbt/journal/v22/n10/full/nbt1004-1315.html

What is a support vector machine?
http://www.nature.com/nbt/journal/v24/n12/full/nbt1206-1565.html

What is the expectation maximization algorithm?
http://www.nature.com/nbt/journal/v26/n8/full/nbt1406.html

Thursday, June 30, 2016

Syntax Highlight Code in Keynote or Powerpoint

I came across this awesome gist explaining how to syntax highlight code in Keynote. The same trick works for Powerpoint. Mac only.

Install homebrew if you don’t have it already and brew install highlight.
highlight -O rtf myfile.ext | pbcopy to highlight code to a formatted text converter in RTF output format, and copy the result to the system clipboard.
Paste into Keynote or Powerpoint.

If I’ve got some code in a file called eset_pca.R:

I can simply highlight -O rtf eset_pca.R | pbcopy and then paste it right into Keynote or Powerpoint.