In addition, the FliH sequence from Salmonella and the FliH sequence was H. pylori were used as input to PSI-BLAST, and the sequences attaining e-values of less than 10-3 after two iterations were downloaded. All
of these sequences were aggregated into a single set that will be denoted “”set A”". Filtering of FliH sequences Redundancy in set A was reduced by using the EMBOSS [28] program HER2 inhibitor needle to perform pairwise global alignments [29] between all possible pairs of sequences. That is, each sequence in set A was globally aligned with every other sequence, and the % identity between each pair of sequences was recorded. The gap opening penalty used in needle was 8, while the gap extension penalty was set to 0.5; Bromosporine in vitro all other settings were left at their default values. Using the % identity data for each pair in set A, a new set of proteins (“”set B”") was derived such that no protein in the latter set was more than CB-839 in vitro 25% identical to any other protein in that same set. The purpose of this was to eliminate as much as possible the phylogenetic signal, which could
potentially confound the statistical results. This set was used to derive the data shown in Figures 4, 5, 7 and 8. For comparison purposes, a larger set of proteins was created; in this set, no protein was more than 90% identical to any other protein. Analysis of this set is shown in Additional files 3 and 4. Note that the obvious method for deriving set B is simply to randomly delete one of the proteins whenever two proteins in set A are found to be more than 25% identical. However, this method may result in more proteins being deleted than necessary; consider three proteins X, Y, and Z, and that proteins X and Y are both more than 25% identical to protein Z, but are not more than 25% identical to each other (casual testing suggested that this does happen occasionally). Suppose that X is first compared to Z and found to be more than 25% identical, and X is arbitrarily chosen for deletion. Then Y is compared to Z, and one of these proteins is deleted. Now only one protein is left, despite the fact that only Z needed to be deleted in
order to satisfy the requirements of set B. To solve this problem and maximize the number of sequences left after filtering, the following algorithm was used: for each protein IKBKE p in set A, a set ψ p is maintained that contains all the other proteins that are more than 25% identical to p. The sequence M with the highest value of |ψ M | is found, and M is then removed from set A; in addition, M is also deleted from every other protein’s ψ p . This process is repeated until ψ p = ∅ for all p. To remove proteins that were unlikely to actually be FliH, the mean length μ of the sequences in set B was computed, as well as the standard deviation σ of these lengths. Protein sequences having a length outside the range μ ± 1.5σ were deleted.