Byonic User Manual v2.6 (Sept 2015)

Table of Contents

1 Overview
2 System Requirements
3 Input
3.1 …Digestion and Instrument Parameters
3.2 …Modifications
3.3 …Glycans
3.4 …S-S,Xlink
3.5 …Advanced Options
3.6 …Progress
4 Byonic Output
4.1 …Excel Output
4.2 …Viewer Output
4.3 …Output Field Descriptions
5 More on Modifications
6 False Discovery Rate
7 Appendix

 

1 Overview

Byonic is a software package for identifying peptides and proteins by tandem mass spectrometry.  Byonic plays the same role as Mascot, SEQUEST, and X!Tandem, but offers greater accuracy, sensitivity, and flexibility.  Byonic provides three major features not found in the other search engines:  Modification Fine ControlTM, Wildcard SearchTM, and glycopeptide search.

Modification Fine ControlTM enables the user to search for 10s or even 100s of modification types at a time without a combinatorial explosion.  For example, a user might allow up to three phosphoserines, S[+80], per peptide, but allow at most one beta elimination, S[-18], and at most one deamidated asparagine, N[+1].  To further reduce the search, the user might allow at most one of either S[-18] or N[+1], that is, disallowing peptides containing one of each.  Modification fine control empowers the user to tailor the search to the sample, and thus avoid overly narrow searches that miss interesting peptides and overly broad searches that run for hours or days and produce “noisy” results with many false positives.

Wildcard SearchTM enables the user to search for unanticipated, or even unknown, modifications.  A wildcard can modify any residue by any mass delta within a user-settable range.  Wildcard masses occur at roughly 1.0 Dalton spacing just like molecular masses.  There is a limit of one wildcard per peptide.

Glycopeptide search enables the user to search for glycosylated peptides, without prior knowledge of either glycosylation sites or glycan masses.  Byonic offers three ways to specify glycopeptide searches:  internal tables, external tables, and modification fine control.  Byonic’s internal tables contain the most likely N- and O-linked glycan compositions, but allow only one glycan per peptide.  The other two options allow the user to customize the list of glycans and/or allow more than one glycan per peptide.

Top-down, middle-down, and bottom-up proteomics – Byonic is uniquely capable of top-down, middle-down, and bottom-up searches, but it requires isotope-resolved precursors in order to determine precursor ion charges.

Disulfide bonds, trisulfide bonds, and general crosslinking – Byonic has disulfide bond, trisulfide bond, and general crosslink search capability.  It is designed to search for all disulfide pairs (both expected and shuffled disulfide bonds) and the user can constrain which protein chains to consider when doing the search.

New features in Byonic Version 2.6 include support for Thermo, Bruker, Sciex, Waters and Agilent native file formats, disulfide bond and general crosslink search capability, tables of predicted and observed ions for peptide-spectrum matches, and greater control of the naming of output folders for search results.

2 System Requirements for Standard License Deployment

Minimum requirement:                                  Recommended PC:

Windows 7 64-bit                                               Windows 10 64-bit

4 GB RAM                                                            16 GB RAM

500 GB disk space                                              1TB disk space (Solid State SSD)

One core CPU                                                      Many core CPU (E.g. Intel Core i7-5930K @ 3.50GHz)

Java 7 or higher                                                  Java 7 or higher

 

Recommend PC for Licensed Deployments on High Performance PC for Extended Numbers of Cores (32+ cores)

Windows Server 2012

64 GB RAM

2 TB disk space

Two Xeon CPUs (recommend 8 to 16 physical cores)

Java 7 64-bit or higher

3 Input

Figure 1 shows an example Byonic search set-up in the graphical user interface (GUI).  In the top pane of Byonic’s control window, there are boxes for setting the two main inputs:  a spectrum data file in a standard format (MGF, mzML, mzXML, or Thermo RAW), and a protein database in FASTA format.  (Note Byonic can search data from all major vendors’ mass spectrometers.)  The protein database should contain both targets and decoys (recognized by protein names beginning >Reverse or >Decoy) for false discovery rate (FDR) estimation.  Byonic will automatically add decoys and contaminant proteins (e.g., trypsin, bovine serum albumin, and human keratins) if the appropriate boxes are checked.

An organized way to configure Byonic is to create input folders C:\data_input\Mass_Spectra and C:\data_input\Protein_Databases, and an output folder C:\data_results.  For each search, Byonic will create a reformatted spectrum file inside the Mass_Spectra directory with extension .byspec2, and a new folder inside the results folder (conventionally C:\data_results) with a long name including the time of the Byonic search and the name of the spectrum data file.  This new folder will contain the output in two different forms:  a Microsoft Excel spreadsheet with extension .xlsx and a Byonic results file with extension .byrslt, to be read by Byonic’s output Viewer.   The new folder will also contain a subfolder named objs, which will contain log files and the parameters file (extension .byparms) used to run Byonic.  The .byparms file contains all the provenance to reproduce the search.

Note a new feature of Byonic v2.6 is that users have additional control of the output folder and the individual search results folder name.  In figure 1 below just below the Input files section, there is a new section called Output folder.  The button on the left allows a user to easily see and to designate the output parent folder for search results.  (Due to a Windows DOS limitation, the total folder path needs to be less than 256 characters.)  To the right is a field that allows a user to designate a result folder name.  Users have the ability to name the folder anything, but there are some default templates that Byonic recognizes and will substitute for in the result folder name to add detail to a name; Byonic will fill in:

  • [spec] with the name of the mass spectrum data file,
  • [fasta] with the name of the fasta file,
  • [date] with the current date (year,month,day), and
  • [time] with the current time (hour, minute)

In the bottom pane of the input GUI, there is a button to run the program, a pull-down menu that controls the number of computer cores of the CPU that Byonic will use, and checkboxes that determine what happens upon completion.  The middle pane is the one that specifies the search itself.  This pane has six tabs:  Digestion and Instrument Parameters, Modifications, Glycans, S-S-, Xlink, Advanced, and Progress.

Byonic_Fig1

Figure 1.  Byonic input GUI with the Digestion and Instrument Parameters tab open.

3.1 Digestion and Instrument Parameters

Figure 1 shows the Digestion and Instrument Parameters tab allows the user to set the residues recognized by the digestion enzyme.  In this example, the enzyme is trypsin, so the user entered RK for arginine and lysine and chose C-terminal for the cleavage side.  If the user leaves the Cleavage sites box empty, the only specific cleavage sites are protein termini.   Here the user chose Fully specific search, meaning that both the N- and C-terminal cleavages must be C-terminal to R or K.  Byonic supports nonspecific cleavage at either or both endpoints.  (A nonspecific search with RK in the Cleavage sites box searches all peptides but favors tryptic peptides; the user must leave the Cleavage sites box empty for a true no-enzyme search.)  The user selected 2 Missed cleavages which limits the maximum number of internal Rs and Ks not followed by P to 2; leaving Missed cleavages at its default value of -1, which means any number of internal Rs and Ks.

The user chose 10 ppm precursor mass tolerance, 40 ppm fragment mass tolerance, and QTOF/HCD fragmentation.  Byonic supports both Dalton and ppm mass tolerances for both precursors and fragments, and supports CID, TOF-TOF, QTOF HCD, ETD/ECD, and EThcD fragmentation types.  The Dalton tolerance applies to measured mass for precursors but measured m/z for fragments.

 

Byonic_Fig2

Figure 2.  Modifications are shown in a text box.  Press Enter/Edit to pop up the menu in Figure 3 and change the modification settings.

3.2 Modifications

The Modifications tab, shown in Figure 2, is where the user finds Modification Fine Control and Wildcard Search.  Like most proteomics search engines, Byonic supports two types of modifications:  fixed and variable.  A fixed modification is assumed to occur on all the residues of that type, but a variable modification is optional, so that each site for a variable modification is considered with and without the modification.

Byonic also offers a unique feature not found in other search engines:  the user designates each variable modification as either “common” or “rare”, with the names suggesting their use.  Byonic has separate limits on the number of occurrences of each variable modification, so that “common2” means at most two occurrences per peptide.  Byonic also has separate limits on the total number of common and rare modifications per peptide.  A typical search allows a total of at most two common modifications and a total of at most one rare modification per peptide.  To search for, say, three phosphoserines per peptide, the user can change Total common modification max to 3 or split phosphorylated serine between two rules:  common2  and  rare1.  Depending upon the other modification rules, the latter approach may give a faster search.

In Figure 2, the user specified Carbamidomethyl / 57.021464 @ C | fixed, meaning carbamidomethylated cysteine (camC).  The user also specified  Oxidation / +15.994915 @ M | common2, directing the program to consider each methionine residue with and without this modification, up to a limit of 2 such modifications per peptide.  In addition, the user specified Ammonia-loss / -17.026549 @ N-term C | rare1, indicating that the program also considers any N-terminal camC as a rare variable modification.  Variable modifications are added on top of fixed modification.  One way to represent incomplete carbamidomethylation is with these two rules:  Carbamidomethyl / +57.021464 @ C | fixed and (De)Carbamidomethyl / -57.021464 @ C | common2.

The rule Carbamidomethyl / +57.021464 @ NTerm | rare1specifies a common artifact (over-alkylation) on the peptide N-terminus.

The next two rules, +0.984016 @ N | common2  and  +0.984016 @ Q | common1, represent deamidation; here the user is allowing up to two deamidated asparagines (the more common deamidation) but only one deamidated glutamine per peptide.  The rule

Gln->pyro-Glu / -17.026549 @ NTerm Q | rare1

specifies a modification that occurs only on peptides with N-terminal glutamine.

Conceptually, Byonic has one modification “slot” for each residue, along with slots for the peptide’s N- and C-termini.  A variable modification such as +0.984016 @ N uses up the residue slot; a nonspecific terminal modification such as +57.021464 @ NTerm uses up the terminal slot; but residue-specific N-terminal modifications, such as -17.026549 @ NTerm Q, use up both the residue and the N-terminal slots.

A variable modification on top of a fixed modification is specified as extra mass, so these two rules

Carbamidomethyl / +57.0215 @ C | fixed
Propionamide / +14.016 @ C | common1

allow Cys to be either C[+57.0215] or C[+71.0375] (a gel artifact). Separate variable mods, however, do not sum, so these two rules

Carbamidomethyl / +57.0215 @ C | common1
Propionamide / +71.0375 @ C | common1

allow Cys to be C[+0] (unmodified), C[+57.0215], or C[+71.0375].

The user enters modifications by pressing Enter/edit, circled in red in Figure 2, and obtaining a pop-up window.  The user can then specify any number of modification rules via a pull-down menu containing all the modifications listed in www.unimod.org, as shown in Figure 3.  For convenience, frequently used modifications are listed twice, at the top and again in the complete list.  The three pull-down menus in each row select modification type, target residues, and fine control.  There is a fourth pull-down, circled in red in Figure 3, which lets the user delete, invert (as in (De)Carbamidomethyl), or add “attributes” to modifications.  Attributes allow the user to define protein-specific modifications.

 

Byonic_Fig3

Figure 3.  The modifications pop-up contains a pull-down menu containing all of Unimod.

The big open box in Figure 3 is a space for the user to type in custom modifications not listed in Unimod.  Byonic’s fine control format has the form:

Modification_Name / Mass_Delta @ Targets | Fine_Control

Modification_Name / is optional.  The Targets field allows the 20 one-letter amino acid abbreviations, as well as four special locations:  NTerm, CTerm, Protein NTerm, and Protein CTerm.  NTerm, CTerm, Protein NTerm, and Protein CTerm can also be used as modifiers of amino acid residues.  Targets form a comma-separated list.  Here’s an example of a real modification not (yet) in Unimod:

DehydroFormyl / +9.98435 @ NTerm S, NTerm T | rare1

Note – for comprehensive sequence variant searches, it is more convenient to paste in a list of sequence variant modifications here than to add the 380 potential sequence variant substitutions via the drop-down menus.  Such lists are available from Protein Metrics by contacting support@proteinmetrics.com.

For backwards compatibility with initial versions of the program, Byonic still accepts the syntax from earlier versions:

[Residues][Mass_Delta], Fine_Control

Examples of the earlier syntax are  [ST][+79.966], common2  for phosphorylation and N-terminal S[+9.984], rare1 for DehydroFormyl.

The rightmost box in the middle pane of the Modifications tab (see Figure 2) controls Wildcard SearchTM.  This box lets the user turn on wildcard search, set the range for the wildcard mass, and restrict the wildcard to certain residues if desired.  In the Restrict to residues box, the 20 single-letter residue abbreviations have their usual meanings, and (lower case) n denotes peptide N-terminus and (lower case) c denotes peptide C-terminus.   A wildcard, even one with a mass range of only 50 or 60 Da, greatly increases the size of the search, so it is best used with a focused database (see the section on the Advanced tab below) and used either alone or with only a few other modifications enabled.  Most wildcard mass shifts will be recognizable by an expert; hence, a wildcard can be used to discover which known modifications should be enabled in a subsequent search.  In the pictured example, the user did not use a wildcard.   For more details about the Wildcard Search, see our Application Note.

By specifying most modifications as rare, it is quite feasible to search for 10 – 20 modification types at once with Byonic.  Even larger searches are possible with focused protein databases, for example with therapeutic proteins.  Such a focused database easily allows efficient mutation searches with 200+ possible substitutions or oxidative footprinting searches with 50+ types of oxidations.  Glycans and wildcards can easily enlarge the search space by 2 – 3 orders of magnitude, so these options should be used with care, and in conjunction with only the most common variable modifications such as oxidized methionine or pyro-Glu N-terminus.  NOTE:  The single most important factor in search time is Total common max.  Roughly speaking, the search time grows as CT where C is the number of common modifications enabled and T is Total common max.

The Appendix of this Manual provides examples of frequently found modifications and appropriate syntax for including those modifications in a Byonic search.

3.3 Glycans

Byonic_Fig4

Figure 4.  Byonic offers three ways to define glycan modifications:  internal preset tables, external glycan databases, and user-defined glycans.

The next tab after the Modifications tab is the Glycans tab.  Clicking on the Enter/Edit button in this tab pops up a window labeled Select Glycans as seen in Figure 4.  The top part of the window (circled in red) allows the user to input a set of glycans all at once.  The user chooses Glycan type (N- or O-linked), browses for a text file of glycan compositions, and then sets Fine Control (rare1, common2, etc.).  The text file gives one glycan composition per line; for example, the following gives five of the most common human O-glycans.   Spaces between monosaccharides are optional, and unused monosaccharides can be left out or included with zero (0) occurrences.

HexNAc(1) Hex(0)

HexNAc(1) Hex(1) Fuc(0) NeuAc(0)

HexNAc(1)Hex(1)Fuc(0)NeuAc(1)NeuGc(0)

HexNAc(1)Hex(1)Fuc(0)NeuAc(2)NeuGc(0)

HexNAc(1)Hex(1)Fuc(1)NeuAc(0)NeuGc(0)

The bottom part of Select Glycans allows the user to input glycans one at a time by specifying monosaccharide compositions.  Byonic allows six monosaccharide residues:  HexNAc, Hexose, Fucose, Pentose (common in plants), NeuAc, and NeuGc (common in non-humans).  Byonic also has a box for Sodium because this is a common adduct on sialic acids.  Other glycan masses and modifications such as sulfation and acetylation can be defined with the Additional mass box; this mass is added to the mass of the monosaccharides.  For some helpful examples and best practices for conducting N-linked and O-linked glycan searches, see our Application Notes.

3.4 S-S,Xlink

The fourth tab is the S-S, Xlink tab, new in version 2.6.  This tab allows the user to search for disulfide-bonded peptide pairs, trisulfide-bonded (also called persulfide-bonded) pairs, and more general cross-linking.  This tab provides options to allow a user to search for expected and unexpected disulfide bonds.  By checking the checkbox beside Disulfide on the left-hand side, and designating numerically which protein sequences from the FASTA database to consider in the box below boxed in Red in Figure X, Byonic will do an in silico digestion based on the digestion parameters designated earlier, and consider every peptide that contains a cysteine and look to pair with other cysteine-containing peptides.  The separation of numbers in the “For FASTA proteins” field below indicates how Byonic should consider the potential pairings.  For example, “1” searches for crosslinks in the first protein only.  “4,5; 7” searches for all potential crosslinks on the 4th, 5th, and 7th protein, #4 and $5 may crosslink to each other, but not with #7.

The Trisulfide option allows a user to search for trisulfides within a single peptide and linking 2 peptides.  Similarly, the Crosslink: DSS and Crosslink: Custom allow a user to search for crosslinks within a single peptide and linking 2 peptides.

The button “Generate details” allows a user to show all the linked peptides and mass deltas that Byonic will search for; to fine-tune the search, the user may edit the text.

Byonic_FigS-S

3.5 Advanced Options

Byonic_Fig5

Figure 5.  GUI view with the Advanced tab open.

Figure 5 shows the fifth tab, simply labeled Advanced.   This tab helps Byonic cope with imperfect inputs.   For example, on many MS instruments precursor ion charges are uncertain for some or all spectra.  By default, Byonic will use the assigned charge for all spectra with assigned charges and use +1, +2, +3 for all CID spectra and +2, +3, +4 for all ETD spectra without assigned charges.   The Apply charges box allows the user to override this default setting.  If the check box is checked, then Byonic will apply all comma-separated charges in the box to each spectrum, +2, +3, +4 in the Figure.

Similarly, on many instruments the nominal precursor mass may actually be the mass of a 13C isotope peak rather than of the base (all 12C monoisotopic) peak, so the true precursor mass will within 10 ppm of 2350.120 Da or within 10 ppm of 2351.123 Da.   Precursor isotope off by x is a pulldown menu with three options:  No error check, which will use only the assigned precursor; Off by one or two, the default, which will allow the assigned precursor to be up to 2 Da too high; and Off by one or more, which will allow the assigned precursor to be up to n Da too high for a precursor of mass at least 1000n Da.

Byonic can also calculate the precursor and charge assignments directly from the MS1 data or use the originally assigned values.  Byonic is now also able to consider multiple precursors per scan.  The middle pane of the Advanced tab offers options for filtering the peptide-spectrum matches (PSMs) by score.  By default, Byonic defers PSM filtering until after protein ranking, and then filters to control PSM FDR on the “true” proteins—those ranked above the top-ranking decoy protein.  This method gains sensitivity while simultaneously reducing both protein and PSM FDRs.  (Two-dimensional target decoy strategy for shotgun proteomics, Journal of Proteome Research 10 (12), 5296-5301, 2011.)  To filter PSMs before protein ranking, the user can uncheck the Automatic score cut box and type in a minimum Byonic score.  For example, a score threshold of 200 will remove weak matches and a threshold of 400 will remove all but the best matches.  Filtering by score may be helpful in special cases, for example to eliminate from consideration all but the best wildcard PSMs.

In the rightmost pane of the Advanced tab there is a checkbox labeled “Create focused database”.  Checking this box directs Byonic to output a new FASTA file (labeled focused and appearing in the output objs directory) containing only the proteins found in the search, along with suitable decoys (>Reverse) for unbiased FDR estimation.  The focused database can then be used for subsequent wide searches, including more modifications and/or a wildcard.  Of course, the user can also create focused databases outside of Byonic by editing existing FASTA files.  The rightmost pane also gives the user control of the protein list cut-off.  By default, Byonic cuts the protein list at 1% protein FDR or 20 decoy proteins, whichever comes last, but the user can ask for 2% protein FDR or a completely unfiltered (but still ranked) protein list.

Due to Byonic’s many options and capabilities, writing modification rules and setting parameters can be a nontrivial task.  For this reason, Byonic allows the user to save all inputs using the button labeled Save parameters and load a previously saved input, which can then be edited, using the button labeled Load parameters.  Reset parameters blanks out the modification rules and restores defaults.

3.6 Progress

The final tab is the Progress tab, which is also new for version 2.6.  When the program is launched and running, this tab will show the progress bar of the current search.

 

4 Byonic Output

Byonic writes its outputs (with filename extension .byrslt) that can then be viewed and explored interactively by Byonic’s output Viewer, a separate program from the Byonic search engine.  Byonic also writes the output data to an Excel spreadsheet (.xlsx) for viewing, sorting, importing into other programs, and sharing with collaborators.  In both output formats, Byonic organizes its findings into two lists, one for proteins and one for PSMs (peptide-spectrum matches).

4.1 Excel Output

Figure 6 shows Sheet 1 (Summary) of the Excel output, which gives critical statistics such as numbers of proteins, peptides, and spectra.  The two other Excel sheets give the Protein view and the Spectrum (or more precisely, PSM) view.  Although most of the information is the same between the Excel output and the Viewer, unique to the Excel spreadsheet is certain information on the Summary page, including some informative statistics, the search parameters, and two figures—Protein Score Plot and Precursor Mass Errors.

Byonic_Fig6

Figure 6.  Byonic output (Summary worksheet) as viewed in Excel.

 

4.2 Viewer Output

Figure 7 shows Byonic’s Viewer, which includes four interactive views:  (1) Protein List in the upper left pane, (2) Protein Coverage map for the selected protein in the lower left, (3) Peptide List (PSMs) for the selected protein(s), and (4) Annotated Spectrum for the selected PSM.  These views are interconnected:  changing the selected protein in view (1) changes views (2) and (3) and changing the selected PSM in (2) or (3) changes the annotated spectrum in view (4).  The user can select all proteins at once with the checkbox at the top left, next to Prot Rank; this then fills view (3) with all PSMs.  The four views are dockable and rearrangeable for a customized screen layout—especially useful for double-headed displays.  Figure 8 shows the Spectrum view undocked (detached from the other panes) and resized.  A selection under Windowon the top bar restores the default layout.

The user can rearrange columns in the protein and peptide lists, as well as hide/show columns and adjust their widths for optimum viewing.  To hide/show columns, right-click on the headings bar to open the Header Editor.  Mousing over column headings and icons brings up Tool Tips.  The protein and peptide lists can be sorted by any column value by clicking on the column header, and the lists can be filtered using the text boxes on the top bars.  For example, to find all phosphopeptides, filter for peptides containing the string [+79.966].  The Viewer layout persists upon exit, and customized layouts can be saved (under Window) as small files with the suffix .ini.

The peptide list includes a large number of possible columns.  The most important ones are the peptide sequence, Byonic score and log prob, and scan number (most often found in the comment column).  In the case of an identification that includes a glycan from the predefined glycan tables, the glycan’s monosaccharide composition is given by a sequence of 6 numbers, the numbers of HexNAc, Hex, Fuc, NeuAc, NeuGc, and sodium.  The column header NHFAGNa serves as a mnemonic for the composition code.   The various numbers in the peptide list columns are explained in more detail below.

The annotated spectrum view allows the user to inspect identified peaks and associated fragment errors for manual validation of identifications and modification placements.  Byonic annotates the most commonly observed ion series (a, b, y, c, z, b++, y++, etc.) for the type of fragmentation, along with common neutral losses from the precursor (e.g., M–98  for loss of phosphoric acid) , and charged losses from glycans.  One of Byonic’s annotations is non-standard:  ~y7 means y7 with loss of labile modifications (e.g., O-glycans).

The spectrum view includes a vertical-line cursor, which is movable by the mouse and allows the user to line up annotated peaks with their associated m/z errors.  When the cursor is positioned exactly over a spectrum peak, the reported m/z and intensity of that peak is shown inside curly braces {  ,  }.  When the cursor is not directly over a peak, these numbers give the m/z and intensity (x, y coordinates) of the cursor position.  The spectrum view also presents a cleavage diagram for the amino acid sequence, which is a standard summarization of the evidence for the identification.  There are buttons that turn on/off the peak and cleavage diagram annotations.  Finally, there are buttons for zooming in (magnifying glass with a +), zooming out (magnifying glass with a -), panning (rosette of compass directions), and a reset button (magnifying glass with 1:1).

 

Byonic_Fig7new

Figure 7.  Byonic output as viewed in Byonic’s interactive Viewer.

 

There is a toggle-button in the upper left corner of the Spectrum view in Figure 7.  When the Ion Table is shown, by default, the observed values are shown in RED.  The user can optionally expand to show observed and Delta masses in separate columns.  There is an export button that allows an analyst to export the table to the clipboard.

Byonic_Fig7b

Figure 8.  Byonic’s Viewer with the Spectrum view (MS2 plot and fragment errors) undocked.

 

 

4.3 Output Field Descriptions

The Excel spreadsheet and the interactive Viewer are two ways of exploring the same content.  The following descriptions apply to both forms of output.

Protein List.  Byonic outputs a protein list ranked by the protein p-value.  The protein p-value is the likelihood of the PSMs to this protein arising by random chance, according to a simple probabilistic model.  For example, a protein p-value of 10–3 = 0.001, or one chance in a thousand, means that in a search against a database containing 10,000 independent proteins, we expect to see only about ten p-values better than 0.001 arising at random.  Byonic’s p-values are only as accurate as the probabilistic model, however, so the user should also check the ranking of the proteins relative to the decoy (>Reverse) proteins.  Byonic actually outputs the absolute value of the logarithm base 10 of the p-value.  Logarithms are useful to prevent numerical underflow, and the absolute value offers compatibility with follow-on tools that expect a larger-is-better score.  Byonic’s protein list includes a number of columns that reflect the quality of the PSMs for each protein:

|Log Prob| – Absolute value of the log base 10 of the protein p-value

Best |Log Prob| – Largest |Log Prob| of an individual PSM assigned to the protein

Best Score – Largest Byonic score of a PSM assigned to the protein.  For “one-peptide-per-protein” samples, for example topdown proteomics or N-glycopeptide samples, sorting by Best Score may give a better protein list (lower-ranking decoys) than protein |Log Prob|.

Total Intensity – Sum of all fragment peak intensities over all MS/MS spectra

# of spectra – Total number of PSMs, including duplicate PSMs

# of unique peptides – Total number of PSMs, discounting duplicates.  (The same modification differently placed counts as a distinct PSM.)

Coverage % – Percent of the protein sequence covered by PSMs

Byonic lumps proteins P1, P2, … into a protein group if exactly the same spectra match all the proteins in the group.  An “ambiguous” PSM, meaning one matching a peptide that is found in two or more proteins, is always assigned to the higher-ranking of two proteins it matches.  Thus, if P1 has separate evidence but P2 does not, P2 will not be shown, and if P1 has a lot of separate evidence but P2 has only a little, then P1 will be ranked according to all its evidence, but P2 will be ranked according to only its separate evidence.  For this reason, as well as many other reasons, none of Byonic’s outputs (# of spectra, total intensity, etc.) is an accurate measure of protein abundance.

Peptide-Spectrum Match List.  As initially presented, Byonic’s PSM list is organized by protein, and left to right (that is N- to C-terminus) by starting position within proteins.  The user can also sort the list by other columns, for example, in order to see how the scores of phosphopeptides or glycopeptides compare to the scores of decoy peptides, or to see how PSM scores vary across all proteins.  The PSM list includes the following columns, as well as some others that require no explanation:

Off-by-x Error – [MObserved – MComputed], where MObserved is the observed M+H (singly charged) precursor mass and MComputed is the computed M+H precursor mass, and [ ] means closest integer.

Mass Error (ppm) – 106 x (MObserved – MComputed) / (MComputed).  The ppm mass error is computed after correcting for off-by-x errors.

Starting Position – Position within the protein of the N-terminal residue of the peptide.

Cleavage – Digestion specificity, where Specific means fully specific, Nragged means nonspecific at the N-terminus, Cragged means nonspecific at the C-terminus, and Non means nonspecific at both termini.

Score – Byonic score, the “raw” indicator of PSM correctness.  Byonic scores reflect the absolute quality of the peptide-spectrum match, not the relative quality compared to other candidate peptides.  Byonic scores range from 0 to about 1000, with 300 a good score, 400 a very good score, and PSMs with scores over 500 almost sure to be correct.

Delta – The drop in Byonic score from the top-scoring peptide to the next distinct peptide.  In this computation, the same peptide with different modifications is not considered distinct.

DeltaMod – The drop in Byonic score from the top-scoring peptide to the next peptide different in any way, including placement of modifications.  DeltaMod gives an indication of whether modifications are confidently localized; DeltaMod over 10.0 means that there is a high likelihood that all modification placements are correct.

|Log Prob| – The absolute value of the log10 of the posterior error probability (PEP).  The PEP takes into account the Byonic score, delta, precursor mass error, digestion specificity, and so forth (10 features in all).  For PSMs with non-negligible error probabilities, say error probabilities > 0.0001, and hence |Log Prob| < 4.0, PEPs are in good agreement with “local FDR” measured by the decoy sequences.  |Log Prob| incorporates more evidence than Byonic scores, so that sorting by |Log Prob| will almost always give a better ROC curve than sorting by score.

PEP 1D – PEP stands for “posterior error probability”.  Byonic collects statistics for “true” and “false” PSMs: score, delta, precursor mass error, digestion specificity, and so forth (10 features in all).  “True” PSMs are defined to be high-scoring PSMs from high-scoring proteins, and “false” PSMs are defined to be decoy PSMs.  (This is “semi-supervised” statistical learning, because decoys accurately model false PSMs but high-scoring PSMs from high-scoring proteins are only an approximation to true PSMs.) PEP 1D is the probability that a PSM came from the “false” distribution rather than the “true”.

FDR 1D – Byonic ranks all PSMs, target and decoy, by increasing PEP 1D.  FDR 1D for the PSM with rank n is the number of decoy PSMs with rank at most n divided by the number of target PSMs with rank at most n.  (If FDR 1D ever goes above 1.0, we simply report 1.0.) Notice that FDR 1D is not monotonic with increasing PEP 1D; each decoy causes FDR 1D to jump up and then a string of targets causes FDR 1D to decline until the next decoy.

FDR 2D – One deficiency in PEP 1D and FDR 1D is that they do not take into account the protein of origin for a PSM.  A low-scoring PSM to a top protein is more likely to be true than a low-scoring match to an arbitrary protein.  Assuming that decoy PSMs come from reversed proteins, a two-dimensional ranking procedure can compute a “protein-sensitive” FDR.  First cut the PSMs by protein of origin to an acceptable protein FDR; then add into the list of remaining PSMs all decoys coming from reverses of accepted proteins; rank the augmented PSM list by PEP 1D; and finally define FDR 2D for the PSM with rank n to be the number of decoy PSMs with rank at most n divided by the number of target PSMs with rank at most n.

PEP 2D – We can also compute a two-dimensional version of PEP by giving a bonus to PSMs from top proteins (proteins ranked above the top decoy protein).  The chance that a random PSM hits such a protein is roughly the fraction of the protein database occupied by top proteins, so we simply multiply PEP 1D by this fraction to obtain PEP 2D.  In order that the bonus not cut off abruptly at the top decoy protein, we fade out the bonus in a band around the top decoys.
FDR Unique 1D – Rank PSMs by PEP 1D.  Remove all PSMs that are identical to higher-ranked PSMs to form a list of unique PSMs.  FDR Unique 1D for the unique PSM in rank n (and for all its now-removed duplicates) is the number of decoy unique PSMs ranked at most n divided by the number of unique target PSMs ranked at most n.

FDR Unique 2D – Rank PSMs by PEP 2D and then follow the procedure for FDR Unique 1D.  FDR Unique 1D and 2D are better measures of false discovery rate than FDR 1D and 2D; duplicates should be discounted because 100 duplicates are only a single “discovery”.

q-value 1D – Target/decoy-based FDR suffers from two small flaws: (1) a PSM can have lower PEP but higher FDR than another PSM, and (2) the FDRs of PSMs ranked above the top decoy are all zero, even though these PSMs may vary from perfect matches down to merely okay matches.  q-value 1D is a version of FDR 1D that corrects both flaws may lose some accuracy at higher FDR values due to the monotonicity requirement.

q-value 2D – A version of FDR 2D that corrects the flaws mentioned above.

# of unique peptides – Total number of PSMs for the protein that “owns” this PSM, discounting exact duplicates.  The same peptide with the same modification differently placed counts as a distinct PSM.

Comment – The comment= value is from the TITLE= line in the .mgf.  Depending upon what software wrote the .mgf, the TITLE= line sometimes contains the scan number from the raw file.

Scan # – This is the scan number for the MS/MS spectrum.

Scan time – This is the time of the scan, usually reported in seconds (depending on the input data file format).  This is usually also considered the retention time (R.T.) in LC-MS experiments.

 

5 More on Modifications

Here we provide more details about Byonic’s modification rules.

Known Masses.  As a convenience for custom (typed-in) modification rules, Byonic maps certain integer masses to commonly used exact masses.  For example, Byonic automatically maps +16 to +15.994915, the monoisotopic mass of oxygen.   Byonic maps the following (6 digits internally, but only 3 digits shown here):  1 @ 0.984, –1 @ –0.984, 14 @ 14.016, 16 @ 15.995, –17 @ –17.027,  –18 @ –18.011, 22 @ 21.982, 28 @ 28.031, 32 @ 31.990, 42 @ 42.011, 43 @ 43.006, 48 @ 47.985, 57 @ 57.021, 58 @ 58.005, 80 @ 79.966.   As further shorthand, the + before a positive mass delta is unnecessary, so that S[+80] and S[80] are both acceptable.

Bundled Modifications.  A modification rule can specify more than one target residue at a time.  For example, the one rule

Phospho / +79.966331 @ S, T | common2

is identical to the two rules

Phospho / +79.966331 @ S | common2

Phospho / +79.966331 @ T | common2

Using the old syntax, one would write  [ST][79.9663], common2.  Notice that the new syntax requires a comma separating S and T, but the old  syntax requires no comma.

Protein Terminal Modifications.  The following rule specifies acetylation on the protein, but not peptide, N-terminus:

Acetyl / +42.010565 @ Protein NTerm | rare1

In the old syntax, we write  N-terminal [42.011], rare | ProteinTerminalMod

Protein-Specific Modifications.  The following rule specifies hydroxyproline only on proteins that include the string “collagen” in their FASTA file names.

Oxidation / +15.994915 @ P | common3 | ProteinLabel[icase]{collagen}

The last field is an example of a modification attribute; to add an attribute pull down the menu circled in red in Figure 3 above.  In the old syntax, we write

P[15.995], common3 | ProteinLabel[icase]{collagen}

The keyword [icase] specifies case-insensitive match, so that collagen will also match “Collagen”, “procollagen”, “collegenase”, etc.  The keyword [case] specifies case-sensitive match.  To match a single protein, use a unique identifier such as the accession number.  Protein-specific modifications offer even finer control of the size and focus of a Byonic search and can be extremely useful when the search space would otherwise be unnecessarily large, as mentioned below.  .

Glycopeptide searches.  The fully automatic, pre-set, glycopeptide searches (the checkboxes under Glycan preset tables) allow only one glycan per peptide.  The limitation of one glycan per peptide is not a severe restriction for N-linked glycosylation, because few peptides contain two N-glycosylation motifs, that is, two occurrences of NX{S/T}.  This limitation is, however, a serious restriction for O-linked glycosylation, especially mucin glycosylation, and for O-GlcNAc-ylation on S/T, because sites for these modifications often cluster together.  The following rule gives a reasonable O-GlcNAc search.

HexNAc / +203.079373 @ S, T | common3

Alternative forms for the same rule are HexNAc @ S, T | common3 (letting the software compute the mass) and HexNAc @ OGlycan | common3.  For a faster but narrower search, use HexNAc @ S, T | common2, or for an intermediate search, designate the modification as both common2 and rare1.  Here is a slightly expanded search rule:

HexNAc / +203.079373 @ NGlycan, S, T | common3

NGlycan as a modification target means the asparagine in the N-glycosylation motif NX{S/T}, where X is any residue except proline.  The expanded rule searches for truncated N-glycans; with the earlier rule a peptide with a truncated N-glycan is likely to be mistaken for an O-GlcNAc peptide.  With the old syntax, one would use two rules[ST][+203.079], common3 and NGlycan[+203.079], common3 for the expanded rule above.

The NGlycan keyword enables the user to search for more than one N-glycan per peptide.   The keyword also enables the user to customize the N-glycosylation search to specific glycan masses, rather than rely on Byonic’s predefined tables.  For example, a researcher may have already acquired detailed knowledge of the glycan masses through detached glycan analysis.  For human blood serum samples, the most common N-glycan is usually HexNAc(4)Hex(5)NeuAc(2) at 2204.77 Da.

Mucin-type O-glycosylation sites tend to cluster, so that a single tryptic peptide may have 5 or more glycosylation sites.  Searches allowing 4 or more common modifications are computationally intensive, so these searches are best run with small protein databases and limited lists of glycan masses.  With Total common modification max set to 5, a search involving 10 proteins (typically the protein of interest, along with trypsin and contaminants) and 10 likely O-glycans (for example, HexNAc(1), HexNAc(1)Hex(1), HexNAc(1)Hex(1)NeuAc(1), HexNAc(1)Hex(1)NeuAc(2), HexNAc(2)Hex(1), etc.) may already be an undesirably large search.  Making the glycosylation protein-specific using theProteinLabel attribute is one way to help control the combinatorial explosion.

Byonic allows the user to combine handcrafted modification rules with one or both of the predefined glycopeptide searches.  One strategy is to use the predefined searches for exploration, and then iteratively move the modification masses found in confident identifications into handcrafted rules in order to focus the range of modification masses while simultaneously increasing the number of allowed instances of each modification.

Wildcard mass defect.  If the precursor mass tolerance is tight (less than 100 ppm) Byonic obtains the exact mass of the wildcard from the difference between the observed precursor mass and the calculated mass of the candidate peptide.  If the precursor mass tolerance is high (≥ 100 ppm), Byonic uses a mass defect (fractional part of the mass) characteristic of an organic molecule, specifically 0.05 per 100 Da.   In either case, Byonic will show the wildcard mass it used in the output.  With precursor masses good to 5 ppm or less, a modest-size wildcard, say in the range –40 Da to 40 Da, may pinpoint the elemental composition.

 

6 False Discovery Rate

Internally Byonic retains all PSMs and all proteins.  For presentation to the user, however, Byonic by default cuts the protein list after the 20th decoy protein or at the point in the list at which the protein FDR first reaches 1%, whichever cut gives more proteins.  On most data sets, almost all the true proteins will be on this truncated list.  Researchers with special knowledge of their samples, along with the time to examine protein identifications manually, may choose to make a lower protein cut using the options on the Advanced tab of the Byonic input window.

After making the protein cut, Byonic makes a PSM cut in order to keep the spectrum-level FDR to a reasonable level, that is, in order to discard false matches to the reported proteins.  (See “Two-dimensional target decoy strategy for shotgun proteomics,” by M. Bern and Y. Kil in J. Proteome Res., 2011, vol 10(12), 5296-5301. PMID 22010998.) By default Byonic makes the PSM cut by discarding the n PSMs with the lowest Byonic scores, where n is the expected number of random PSMs matching the reported proteins.  Byonic then estimates the spectrum-level FDR of the remaining PSMs to the reported proteins; this FDR will typically be in the range 0 – 5%.  Alternatively, using the Advanced tab, the user can ask Byonic to make a manual score cut in order to obtain a desired spectrum-level FDR.  Warning:  A significant number of true PSMs may be lost by imposing an arbitrary low FDR limit rather than accepting Byonic’s automatic cutoff.

And a final word of caution:  The target-decoy strategy gives an accurate estimate of the rate of completely false identifications at both the protein and PSM levels, but it does not give any estimate of “partially correct” identifications, for example, wrong homologues or splice variants in the case of proteins, or misplaced or incorrect modifications in the case of peptides.  Peptide identifications that may be only partially correct can usually be recognized by low DeltaMod (often zero), and manual inspection of the annotated spectrum may in some cases resolve the ambiguity.  Even with all of Byonic’s advances over the previous state of the art, human judgment remains the ultimate arbiter of subtle identification clues.

 

7 Appendix – Common Modifications

 

Below are examples of often encountered modifications and the appropriate syntax for including those modifications in a Byonic search.

 

Table 1.  Cysteine Treatments + Artifacts of Treatment

The modifications in blue apply to samples with the cysteine treatment in black above them.  The syntax for the rules from the pull-down menu is also accepted in the custom modification entry box as well as the “old syntax” pre-dating the new pull-down menu interface.

Rule from Pull-Down Menu Explanation
Carbamidomethyl / +57.021464 @ C | fixed Iodoacetamide treatment
(De)Carbamidomethyl / -57.021464 @ C | common1 Under-alkylation
Ammonia-loss / -17.026549 @ NTerm C | rare1 Pyro-glu from camC
Methyl / +14.01565 @ C | rare1 Propionamide (from gel)
DTT / +151.996571 @ C | rare1 Total mass delta  209 Da
Carbamidomethyl / +57.021464 @ NTerm, H, K | common2 Over-alkylation
Carboxymethyl / +58.005479 @ C | fixed Iodoacetic acid treatment
DTT / +151.996571 @ C | rare1 Total mass delta  210 Da
Methylthio / +45.987721 @ C | fixed MMTS treatment
Propionamide / +71.037114 @ C | common2 Common in gel samples

 

Table 2.  Other Chemical Treatments

Mixtures of isotopically labeled and unlabeled peptides (e.g., SILAC) are best run as two searches with fixed modifications, rather than as variable modifications, because peptides should be either completely labeled or completely unlabeled.

Rule from Pull-Down Menu Explanation
Propionyl / +56.026215 @ NTerm, K | fixed Propionylation
TMT / +224.152478 @ NTerm, K | fixed TMT0 labeling
TMT2plex / +225.155833 @ NTerm, K | fixed TMT2 labeling
iTRAQ4plex / +144.102063 @ NTerm, K | fixed iTRAQ labeling
Label:13C(6)15N(2) / +8.014199 @ K | fixed Heavy Lysine for SILAC
Label:13C(8)15N(2) / +10.020909 @ R | fixed Heavy Arginine for SILAC

 

Table 3.  Common In Vitro Modifications

Rule from Pull-Down Menu
Deamidated / +0.984016 @ N, Q | common2
Carbamyl / +43.005814 @ NTerm | common1
Gln->pyro-Glu / -17.026549 @ NTerm Q | common1
Glu->pyro-Glu / -18.010565 @ NTerm E | common1
Delta:H(2)C(2) / +26.01565 @ NTerm | common1
Oxidation / +15.994915 @ M | common2
Dioxidation / +31.989829 @ M, W | rare1
Trioxidation / +47.984744 @ C | rare1
Dethiomethyl / -48.003371 @ M | common2
Cation:Na / +21.981943 @ D, E | common1
Methyl / +14.01565 @ E | common1
Dehydrated / -18.010565 @ S, T | rare1

 

Table 4.  Common In Vivo (Biological) Modifications

Rule from Pull-Down Menu
Oxidation / +15.994915 @ P | common3 | ProteinLabel[icase]{collagen}
Phospho / +79.966331 @ S, T | common3
Phospho / +79.966331 @ Y | common2
Methyl / +14.01565 @ K, R | common2
Dimethyl / +28.0313 @ K | common2
Acetyl / +42.010565 @ K | common2
Trimethyl / +42.04695 @ K | common2
Acetyl / +42.010565 @ Protein NTerm | common1
Amidated / -0.984016 @ Protein CTerm | rare1
Sulfo / +79.956815 @ C, S, T, Y | common2
HexNAc / +203.079373 @ S, T | common2
HexNAc(1)Hex(1) @ OGlycan | common2
HexNAc(4)Hex(5)NeuAc(2) @ NGlycan | rare1
HexNAc(4)Hex(5)NeuAc(2)Sodium(1) @ NGlycan | rare1
Hex / +162.052824 @ K | common2