CAFA PI Data Exploration

The Challenge

This semester, I’ve been working with the BYU Bioinformatics Research Group on the CAFA PI challenge. The goal of CAFA PI is to predict which proteins perform some specific functions. The first major roadblock: no training data was provided.

So we divided into three groups:

  • Data acquisition (we needed to get our hands on some training data),
  • Data transformation (said training data needed to be formatted), and
  • Data analytics.

The goal of this post is to explore the test data in order to inform our future acquisition / tranformation / analytics decisions.

[Insert cool meme here]

The Data

Targets

CAFA PI gives us protein sequences for two organisms:

  • Pseudomonas aureginosa (this is the spelling the CAFA website gives, but it seems like the correct spelling is aeruginosa), a multidrug resistant bacteria that’s associated with, among other nasty things, hospital-acquired infections.
  • Candida albicans, an opportunistic yeast that’s found in about half of all healthy adults but that will also not hesitate to turn on you as soon as your immune system goes down. Wikipedia says that somewhere between 2,800 and 11,200 people in the U.S. die from Candida infections every year.

The training data comes in FASTA format, so our first order of business is to turn the FASTAs into a tidy, rectangular format. Code to download and parse the data into CSVs can be found on our GitHub repo.

Let’s load the data in and take a quick look:

pseudomonas <- read_csv("data/cafa-pi/target.208963.csv")
pseudomonas$Organism <- "Pseudomonas aureginosa"
candida <- read_csv("data/cafa-pi/target.237561.csv")
candida$Organism <- "Candida albicans"
targets <- bind_rows(pseudomonas, candida)
targets %>% select(-Sequence) %>% head() %>% kable()
CAFA_ID Taxon_ID NCBI_Locus_Tag Organism
T2089630000001 208963 PA14_31220 Pseudomonas aureginosa
T2089630000002 208963 PA14_64790 Pseudomonas aureginosa
T2089630000003 208963 PA14_35250 Pseudomonas aureginosa
T2089630000004 208963 PA14_48450 Pseudomonas aureginosa
T2089630000005 208963 PA14_29300 Pseudomonas aureginosa
T2089630000006 208963 PA14_27470 Pseudomonas aureginosa
targets$Sequence %>% head()
## [1] "MSSHHDYIIEITAQHDALKPFAPENGQPLRFKIGDAVIYTNEYGVQFRRCVTGFYRPSGLCGHYARGARYLLNSTSPWVPVAESSLRPDDSA"                                                                                                                                                                                                                                                                                                                                                                
## [2] "MNRDLRRALDDEPMRPFQWLAVAVCIVLNLIDGFDVLVMAFTASSVAAEWNLGGAEIGLLLSAGLFGMAAGSLFIAPWADRWGRRPLILACLALSGLGMLASALSQAAWQLALLRGLTGLGIGGILASSNVIASEYASRRWRGLAVSLQSTGYALGATLGGLLAVWLIGAWGWRSVFVFGAGLTLAVIPLVCLCLPESLDFLLARRPPRALQRVNALARRLGRAALAELPSPPGGGAEHGSRLARLLCVAQRRTTLLLWALFFLVMFGFYFIMSWTPKLLVAAGLSTAQGITGGTLLSIGGIFGAALLGGLAARFRLERVLALFMLLTAALLALFSLSAGLPGAALPLGLLIGLCANACVAGLYALAPSLYDASVRATGVGWGIGVGRGGAILSPLVAGLLLDDGWQPLSLYGAFAAVFVAAAAVLPLLGARRRERSPTLGDAA"
## [3] "MQKMNAPLQYRLDHADLALVLALVRGGSLARAAELLRVDVSTVFRAIRRLEKNLGKALFDKSRSGYLANQTARRLAQQAELAEQALEAARIGLELEREVLSGTVRLTCSDAVLHGLLLPALARFMPGYPALQLELATSNDFANLSRRDADIALRMTRTPPEHLVGRNLGEVRYVLCASERYLEGRDATEPAMLHWIAPDDFLPDHPSVVWRRRHYPGVLPAYRCNGVLAVSEMARAGLGLAAIPAYLLRDGDGLRVLDADLEGCASALWLLTRPDCRALRSVVALFDELGRHVRLGDE"                                                                                                                                                  
## [4] "MTRRHFLQRLGASAGLGAALTLGLEFGSPRGQAAAADHWHMPDEHLPQERVFLAYAASPSIWKDLAEDVNRSVALLARTIARYQPVTLLCRPTQEAAARRACGGANIDYLALPLDDIWVRDYGGCFVLNGEGEAGLVDFNFNGWGGKQQAGNDSRVAGVLSERLEVRYIASQLTGEGGGIEVDGRGTAILTESCWLNDNRNPGMDKGQVEAELKALLGLRKIVWLPGIRGRDITDAHVDFYARFVRPGVVIANLDNDPDSYDYAVTREHLEILRTATDADGRALQVHTLSPPLKPRRDYSRNNPDFAAGYINYLPINGAVIAPQFGDASADRHCGELLGRLYPGRDIVQLNIDPIAAGGGGIHCVTKHMPRA"                                                                        
## [5] "MALYSAGVEYGIHCLLFLADEKGDSRESSVRTLAELQGVPPELLAKVFTRLAKAGLVAATEGVRGGFRLARPANEISVLDVVRAIDGDKSIFECREVRERCAIFEGNPPSWATRNTCSIHAVMLTAQKRMEEALAQQTILDLVRRVGRIAPPEFGEEVLRWMDASREGRGGSRD"                                                                                                                                                                                                                                                                              
## [6] "MQIRADFDSGNIQVIDASDPRRIRLAIRPDLASQHFQWFHFKVEGMAPATEHRFTLVNAGQSAYSHAWSGYQAVASYDGERWFRVPSQYDADGLHFQLEPEESEVRFAYFKPYSRERHARLVERALGIEGVERLAVGTSVQGRDIELLRVRRHPDSHLKLWIIAQQHPGEHMAEWFMEGLIERLQRPDDTEMQRLLEKADLYLVPNMNPDGAFHGNLRTNAAGQDLNRAWLEPSAERSPEVWFVQQEMKHHGVDLFLDIHGDEEIPHVFAAGCEGNPGYTPRLERLEQRFREELMARGEFQIRHGYPRSAPGQANLALACNFVGQTYDCLAFTIEMPFKDHDDNPEPGTGWSGARSKRLGQDVLSTLAVLVDELR"

The first thing that really stands out here is the huge difference in lengths of the proteins. Let’s look at how sequence length is distributed.

targets <- targets %>% mutate(seq_length=nchar(Sequence))
targets %>% select(seq_length) %>% summary()
##    seq_length    
##  Min.   :  23.0  
##  1st Qu.: 213.0  
##  Median : 353.0  
##  Mean   : 434.6  
##  3rd Qu.: 556.0  
##  Max.   :5212.0
targets %>% ggplot(aes(x=seq_length)) +
  geom_histogram(aes(fill=Organism), bins=50) +
  theme(legend.position="bottom") + 
  labs(x="Sequence Length", y="Count", 
       title="Sequence Lengths of CAFA PI Targets")

It looks like we have more yeast proteins than bacteria proteins and that on average the yeast proteins are longer. I don’t know a ton about the biology of proteins, but this makes intuitive sense to me—bacteria are prokaryotes, and yeast are eukaryotes, so I would expect yeast to have more and more complex proteins.

Now let’s take a look at that tail of super long proteins, just to see if it matches what we’d expect.

targets %>% filter(seq_length > 2500) %>%
  ggplot(aes(x=seq_length)) +
  geom_histogram(aes(fill=Organism), bins=1000) +
  theme(legend.position="bottom") +
  labs(x="Sequence Length", "Count",
       title="The Tail")

Two things here are a little interesting / unexpected to me:

  • Our longest protein is from our bacteria, not our yeast. I’m not too freaked out about this, but we should take a look at that long bacteria protein just to make sure it’s OK.
  • Our yeast proteins seem to almost exclusively come in pairs. It seems really unusual to me that all of these really long proteins would have a different protein with exactly the same length.

OK, let’s take a look, starting with the long bacterial proteins.

targets %>% arrange(desc(seq_length)) %>%
  filter(Organism == "Pseudomonas aureginosa") %>%
  select(NCBI_Locus_Tag, seq_length) %>%
  head()
## # A tibble: 6 x 2
##   NCBI_Locus_Tag seq_length
##            <chr>      <int>
## 1     PA14_32790       5212
## 2     PA14_33610       5149
## 3     PA14_33280       4342
## 4     PA14_55400       4177
## 5     PA14_00510       3443
## 6     PA14_05390       2476

A quick Googling of the NCBI Locus Tags shows that these six longest bacteria sequences all have transposon mutants (for example, PA14_32790). Not positive what that means in this context, but a transposon is a chunk of DNA that can jump into other places in DNA, so maybe these sequences are so long because extra DNA has jumped into them? Or maybe “transposon mutants” means something else. Note: Look to see if shorter sequences have transposon mutants.

Now let’s look to see about duplicates in our yeast.

targets %>% arrange(desc(seq_length)) %>%
  filter(Organism == "Candida albicans") %>%
  select(NCBI_Locus_Tag, seq_length, Sequence) %>%
  # Trim sequences to the first 100 chars so they display nicely
  mutate(Sequence=str_c(str_sub(Sequence, 1, 100), "...")) %>%
  head()
## # A tibble: 6 x 3
##   NCBI_Locus_Tag seq_length                                                                                                Sequence
##            <chr>      <int>                                                                                                   <chr>
## 1    C4_00970C_A       5035 MNNHITISFNQGEELYSTYKSFYTLKKLPNQLFKFNKSLSIDEILNQLSLLALDHTLPTYYCYKPIFLELVARWVNNTIELESQYNKSKSETRRISGSVI...
## 2    C4_00970C_B       5035 MNNHITISFNQGEELYSTYKSFYTLKKLPNQLFKFNKSLSIDEILNQLSLLALDHTLPTYYCYKPIFLELIARWVNNSIELESQYNKSKSETRKIPGSII...
## 3    C3_05210C_B       4161 MEESVTPLLSVNQLYDCIIGFLPFNENVPKFTQCEDVLTKFISNNNQDTLYLIKSIEEKTSRISNDLSDLDMGFKDLDTFVIIIKSKGELRNDIPLTQQL...
## 4    C3_05210C_A       4161 MEESVTPLLSVNQLYDCIIGFLPFNENVPKFTQCEDVLTKFISNNNQDTLYLIKSIEEKTSRISNDLSDLDMGFKDLDTFVIIIKSKGELRNDIPLTQQL...
## 5    C4_06130W_A       3826 MSSKEKPSKEDIRGQLSDIQKRLFSKKPSSGDTSGATTSASTNSNHRPVLKESNHHPQGKSSLLSGGVGSHEQERFADFLPPLPKGSGNASASLGPPAPT...
## 6    C4_06130W_B       3826 MSSKEKPSKEDIRGQLSDIQKRLFSKKPSSGDTSGATTSASTNSNHRPVLKESNHHPQGKSSLLSGGVGSHEQERFADFLPPLPKGSGNASASLGPPAPT...

Looks like our identical lengths are from different alleles (slightly different copies of the same gene due to genetic variation). Now we know! It will be interesting to see whether our classifier predicts the same functions for two alleles of the same gene.