Other criteria for the blastx comparison were tested but we observed no significant difference in the results after the subsequent filters. Candidates with some of their best hits in stramenopiles in addition to bacteria were also retained since some HGTs may be shared between stramenopiles, and genes for which orthologs were identified in non stramenopile species were discarded. The evolutionary origin of the candidate genes was then investigated using phylogenetic approaches. For each gene, homologues were retrieved from the protein nr database using Blastp. The sequences were aligned using Muscle 3.6. The resulting alignments were visually inspected and manually refined using the MUST software. Ambiguously aligned regions were removed prior to phylogenetic analysis.

Maximum likelihood phylogenetic tree reconstructions were carried out on the remaining positions using PhyML with the Le and Gascuel model with a gamma correction to take into account evolutionary rate variation among sites. Tree robustness was estimated by a non parametric bootstrap approach using PhyML and the same parameters with 100 replicates of the original dataset. Bayesian phylogenetic trees were also reconstructed using MrBayes version 3.1.2. We used a mixed model of amino acid substitution and a gamma distribution to take into account site rate variation. MrBayes was run with four chains for 1 million generations and trees were sampled every 100 generations. To construct the consensus tree, the first 1,500 trees were discarded as burn in. The candidates with clear eukaryotic origin were then discarded.

This process provided 133 candidate genes. These candidates contain a high proportion of monoexonic genes compared to the average number of monoexonic genes in Blastocystis sp. Protein domain analysis InterProScan was run against all C. merolae, P. sojae, T. pseudonana and Blastocystis sp. proteins. Matches that fulfilled the following criteria were retained match tagged as true positive by InterProScan. match with an e value 10 1. A total of 2,305 InterPro domains were found in Blastocystis sp. which corresponds to 4,096 proteins. Functional annotation Enzyme annotation Enzyme detection in predicted Blastocystis sp. proteins was performed with PRIAM, using the PRIAM July 2006 Enzyme release. A total of 428 different EC numbers, corresponding to enzyme domains, are associated with 1,140 Blastocystis sp.

proteins. Therefore, about 19% of Blastocystis sp. proteins contain at least one enzymatic domain. Association of metabolic pathways with enzymes and Blastocystis sp Potential metabolic pathways were deduced from EC numbers using the KEGG pathway database. Links between EC numbers and metabolic pathways were obtained from the KEGG website. Using this file and the PRIAM results, 906 Blastocystis sp. proteins were assigned to 201 pathways.

