Protein Metrics

Minimum requirement:                                  Recommended PC:

Windows 7 64-bit                                              Windows 10 64-bit

4 GB RAM                                                          16 GB RAM

500 GB disk space                                             1TB disk space (Solid State SSD)

Dual core CPU                                                    Many core CPU (E.g. Intel Core i7-5930K @ 3.50GHz)

Java 7 or higher                                                 Java 7 or higher

 

Recommend PC for high performance computing (e.g. 32+ cores)

Windows Server 2012 or 2016

64 GB RAM

2 TB disk space

Two Xeon CPUs (recommend 8 to 16 physical cores)

Java 7 64-bit or higher

Please provide the following when reporting an issue with Byos, Byologic, Byomap, IntactMass, Supernovo

 

1. Support information.

Go to the Help menu -> Support package…

When prompted, Compress the package, press OK to navigate to the file.  Include this file in your email.

 

2.  Workflow or preset file

In Byos, click Save as Portable Workflow….

If not using Byos, click the Save Preset…button in New Project window:

 

3. The observed error message.

A screenshot of the message dialog would be best

 

4. Any relevant steps to reproduce the issue.

Was the data dragged & dropped? Which buttons were clicked? At which step was the issue observed? etc.

 

5. [If permissible] The raw data file(s) and the fasta file

These may be required so that we can fully reproduce the issue.  We can provide an upload location for large files. Note: we keep the data confidential, and we can work with your team to sign an NDA if needed.

 

6.  [Conditional] Byonic logs

If there was an issue during the Byonic search stage of a Byos workflow, we would also need the Byonic logs, since these are not currently saved in the Byos log files. Byonic results and log files are saved in:

Project_file_folder / Byos_tmp / Project.xxxxx / Byonic(n) / objs /

Locate the 3 files named: “Log.sys_err.text”, “Log.sys_out.txt”, and “log_operations.txt”.

 

7.  Please send all of the above to support@proteinmetrics.com

To report an issue with Byonic, please provide the following:

 

1.The .byparms file and the Log files

These can be found in the output folder.  E.g.,
C:\data_results\sample.raw_20190101_Byonic\sample.raw_ 20190228_Byonic.byparms

Log files are in objs/ subfolder, 3 files named “Log.sys_err.txt”, “Log.sys_out.txt”, and “log_operations.txt”. E.g.

C:\data_results\Refimab_C18_trypsin_high_10ug.raw_20190228_Byonic\objs\Log.sys_err.txt
C:\data_results\Refimab_C18_trypsin_high_10ug.raw_20190228_Byonic\objs\Log.sys_out.txt
C:\data_results\Refimab_C18_trypsin_high_10ug.raw_20190228_Byonic\objs\log_operations.txt

 

2. [If permissible]  The raw data file(s) and the fasta file

These may be required depending on the problem so that we can fully reproduce the issue.  We can provide an upload location for large files.  Note that we keep these confidential, and we can work with your team to sign an NDA if needed.

 

3.  Please send all of the above to support@proteinmetrics.com

Byonic User Manual v2.6 (Sept 2015)

Table of Contents

 

1 Overview

Byonic is a software package for identifying peptides and proteins by tandem mass spectrometry.  Byonic plays the same role as Mascot, SEQUEST, and X!Tandem, but offers greater accuracy, sensitivity, and flexibility.  Byonic provides three major features not found in the other search engines:  Modification Fine ControlTM, Wildcard SearchTM, and glycopeptide search.

Modification Fine ControlTM enables the user to search for 10s or even 100s of modification types at a time without a combinatorial explosion.  For example, a user might allow up to three phosphoserines, S[+80], per peptide, but allow at most one beta elimination, S[-18], and at most one deamidated asparagine, N[+1].  To further reduce the search, the user might allow at most one of either S[-18] or N[+1], that is, disallowing peptides containing one of each.  Modification fine control empowers the user to tailor the search to the sample, and thus avoid overly narrow searches that miss interesting peptides and overly broad searches that run for hours or days and produce “noisy” results with many false positives.

Wildcard SearchTM enables the user to search for unanticipated, or even unknown, modifications.  A wildcard can modify any residue by any mass delta within a user-settable range.  Wildcard masses occur at roughly 1.0 Dalton spacing just like molecular masses.  There is a limit of one wildcard per peptide.

Glycopeptide search enables the user to search for glycosylated peptides, without prior knowledge of either glycosylation sites or glycan masses.  Byonic offers three ways to specify glycopeptide searches:  internal tables, external tables, and modification fine control.  Byonic’s internal tables contain the most likely N- and O-linked glycan compositions, but allow only one glycan per peptide.  The other two options allow the user to customize the list of glycans and/or allow more than one glycan per peptide.

Top-down, middle-down, and bottom-up proteomics – Byonic is uniquely capable of top-down, middle-down, and bottom-up searches, but it requires isotope-resolved precursors in order to determine precursor ion charges.

Disulfide bonds, trisulfide bonds, and general crosslinking – Byonic has disulfide bond, trisulfide bond, and general crosslink search capability.  It is designed to search for all disulfide pairs (both expected and shuffled disulfide bonds) and the user can constrain which protein chains to consider when doing the search.

New features in Byonic Version 2.6 include support for Thermo, Bruker, Sciex, Waters and Agilent native file formats, disulfide bond and general crosslink search capability, tables of predicted and observed ions for peptide-spectrum matches, and greater control of the naming of output folders for search results.


3 Input

Figure 1 shows an example Byonic search set-up in the graphical user interface (GUI).  In the top pane of Byonic’s control window, there are boxes for setting the two main inputs:  a spectrum data file in a standard format (MGF, mzML, mzXML, or Thermo RAW), and a protein database in FASTA format.  (Note Byonic can search data from all major vendors’ mass spectrometers.)  The protein database should contain both targets and decoys (recognized by protein names beginning >Reverse or >Decoy) for false discovery rate (FDR) estimation.  Byonic will automatically add decoys and contaminant proteins (e.g., trypsin, bovine serum albumin, and human keratins) if the appropriate boxes are checked.

An organized way to configure Byonic is to create input folders C:\data_input\Mass_Spectra and C:\data_input\Protein_Databases, and an output folder C:\data_results.  For each search, Byonic will create a reformatted spectrum file inside the Mass_Spectra directory with extension .byspec2, and a new folder inside the results folder (conventionally C:\data_results) with a long name including the time of the Byonic search and the name of the spectrum data file.  This new folder will contain the output in two different forms:  a Microsoft Excel spreadsheet with extension .xlsx and a Byonic results file with extension .byrslt, to be read by Byonic’s output Viewer.   The new folder will also contain a subfolder named objs, which will contain log files and the parameters file (extension .byparms) used to run Byonic.  The .byparms file contains all the provenance to reproduce the search.

Note a new feature of Byonic v2.6 is that users have additional control of the output folder and the individual search results folder name.  In figure 1 below just below the Input files section, there is a new section called Output folder.  The button on the left allows a user to easily see and to designate the output parent folder for search results.  (Due to a Windows DOS limitation, the total folder path needs to be less than 256 characters.)  To the right is a field that allows a user to designate a result folder name.  Users have the ability to name the folder anything, but there are some default templates that Byonic recognizes and will substitute for in the result folder name to add detail to a name; Byonic will fill in:

  • [spec] with the name of the mass spectrum data file,
  • [fasta] with the name of the fasta file,
  • [date] with the current date (year,month,day), and
  • [time] with the current time (hour, minute)

In the bottom pane of the input GUI, there is a button to run the program, a pull-down menu that controls the number of computer cores of the CPU that Byonic will use, and checkboxes that determine what happens upon completion.  The middle pane is the one that specifies the search itself.  This pane has six tabs:  Digestion and Instrument Parameters, Modifications, Glycans, S-S-, Xlink, Advanced, and Progress.

Byonic_Fig1

Figure 1.  Byonic input GUI with the Digestion and Instrument Parameters tab open.


3.1 Digestion and Instrument Parameters

Figure 1 shows the Digestion and Instrument Parameters tab allows the user to set the residues recognized by the digestion enzyme.  In this example, the enzyme is trypsin, so the user entered RK for arginine and lysine and chose C-terminal for the cleavage side.  If the user leaves the Cleavage sites box empty, the only specific cleavage sites are protein termini.   Here the user chose Fully specific search, meaning that both the N- and C-terminal cleavages must be C-terminal to R or K.  Byonic supports nonspecific cleavage at either or both endpoints.  (A nonspecific search with RK in the Cleavage sites box searches all peptides but favors tryptic peptides; the user must leave the Cleavage sites box empty for a true no-enzyme search.)  The user selected 2 Missed cleavages which limits the maximum number of internal Rs and Ks not followed by P to 2; leaving Missed cleavages at its default value of -1, which means any number of internal Rs and Ks.

The user chose 10 ppm precursor mass tolerance, 40 ppm fragment mass tolerance, and QTOF/HCD fragmentation.  Byonic supports both Dalton and ppm mass tolerances for both precursors and fragments, and supports CID, TOF-TOF, QTOF HCD, ETD/ECD, and EThcD fragmentation types.  The Dalton tolerance applies to measured mass for precursors but measured m/z for fragments.

 

Byonic_Fig2

Figure 2.  Modifications are shown in a text box.  Press Enter/Edit to pop up the menu in Figure 3 and change the modification settings.


3.2 Modifications

The Modifications tab, shown in Figure 2, is where the user finds Modification Fine Control and Wildcard Search.  Like most proteomics search engines, Byonic supports two types of modifications:  fixed and variable.  A fixed modification is assumed to occur on all the residues of that type, but a variable modification is optional, so that each site for a variable modification is considered with and without the modification.

Byonic also offers a unique feature not found in other search engines:  the user designates each variable modification as either “common” or “rare”, with the names suggesting their use.  Byonic has separate limits on the number of occurrences of each variable modification, so that “common2” means at most two occurrences per peptide.  Byonic also has separate limits on the total number of common and rare modifications per peptide.  A typical search allows a total of at most two common modifications and a total of at most one rare modification per peptide.  To search for, say, three phosphoserines per peptide, the user can change Total common modification max to 3 or split phosphorylated serine between two rules:  common2  and  rare1.  Depending upon the other modification rules, the latter approach may give a faster search.

In Figure 2, the user specified Carbamidomethyl / 57.021464 @ C | fixed, meaning carbamidomethylated cysteine (camC).  The user also specified  Oxidation / +15.994915 @ M | common2, directing the program to consider each methionine residue with and without this modification, up to a limit of 2 such modifications per peptide.  In addition, the user specified Ammonia-loss / -17.026549 @ N-term C | rare1, indicating that the program also considers any N-terminal camC as a rare variable modification.  Variable modifications are added on top of fixed modification.  One way to represent incomplete carbamidomethylation is with these two rules:  Carbamidomethyl / +57.021464 @ C | fixed and (De)Carbamidomethyl / -57.021464 @ C | common2.

The rule Carbamidomethyl / +57.021464 @ NTerm | rare1specifies a common artifact (over-alkylation) on the peptide N-terminus.

The next two rules, +0.984016 @ N | common2  and  +0.984016 @ Q | common1, represent deamidation; here the user is allowing up to two deamidated asparagines (the more common deamidation) but only one deamidated glutamine per peptide.  The rule

Gln->pyro-Glu / -17.026549 @ NTerm Q | rare1

specifies a modification that occurs only on peptides with N-terminal glutamine.

Conceptually, Byonic has one modification “slot” for each residue, along with slots for the peptide’s N- and C-termini.  A variable modification such as +0.984016 @ N uses up the residue slot; a nonspecific terminal modification such as +57.021464 @ NTerm uses up the terminal slot; but residue-specific N-terminal modifications, such as -17.026549 @ NTerm Q, use up both the residue and the N-terminal slots.

A variable modification on top of a fixed modification is specified as extra mass, so these two rules

Carbamidomethyl / +57.0215 @ C | fixed
Propionamide / +14.016 @ C | common1

allow Cys to be either C[+57.0215] or C[+71.0375] (a gel artifact). Separate variable mods, however, do not sum, so these two rules

Carbamidomethyl / +57.0215 @ C | common1
Propionamide / +71.0375 @ C | common1

allow Cys to be C[+0] (unmodified), C[+57.0215], or C[+71.0375].

The user enters modifications by pressing Enter/edit, circled in red in Figure 2, and obtaining a pop-up window.  The user can then specify any number of modification rules via a pull-down menu containing all the modifications listed in www.unimod.org, as shown in Figure 3.  For convenience, frequently used modifications are listed twice, at the top and again in the complete list.  The three pull-down menus in each row select modification type, target residues, and fine control.  There is a fourth pull-down, circled in red in Figure 3, which lets the user delete, invert (as in (De)Carbamidomethyl), or add “attributes” to modifications.  Attributes allow the user to define protein-specific modifications.

 

Byonic_Fig3

Figure 3.  The modifications pop-up contains a pull-down menu containing all of Unimod.

The big open box in Figure 3 is a space for the user to type in custom modifications not listed in Unimod.  Byonic’s fine control format has the form:

Modification_Name / Mass_Delta @ Targets | Fine_Control

Modification_Name / is optional.  The Targets field allows the 20 one-letter amino acid abbreviations, as well as four special locations:  NTerm, CTerm, Protein NTerm, and Protein CTerm.  NTerm, CTerm, Protein NTerm, and Protein CTerm can also be used as modifiers of amino acid residues.  Targets form a comma-separated list.  Here’s an example of a real modification not (yet) in Unimod:

DehydroFormyl / +9.98435 @ NTerm S, NTerm T | rare1

Note – for comprehensive sequence variant searches, it is more convenient to paste in a list of sequence variant modifications here than to add the 380 potential sequence variant substitutions via the drop-down menus.  Such lists are available from Protein Metrics by contacting support@proteinmetrics.com.

For backwards compatibility with initial versions of the program, Byonic still accepts the syntax from earlier versions:

[Residues][Mass_Delta], Fine_Control

Examples of the earlier syntax are  [ST][+79.966], common2  for phosphorylation and N-terminal S[+9.984], rare1 for DehydroFormyl.

The rightmost box in the middle pane of the Modifications tab (see Figure 2) controls Wildcard SearchTM.  This box lets the user turn on wildcard search, set the range for the wildcard mass, and restrict the wildcard to certain residues if desired.  In the Restrict to residues box, the 20 single-letter residue abbreviations have their usual meanings, and (lower case) n denotes peptide N-terminus and (lower case) c denotes peptide C-terminus.   A wildcard, even one with a mass range of only 50 or 60 Da, greatly increases the size of the search, so it is best used with a focused database (see the section on the Advanced tab below) and used either alone or with only a few other modifications enabled.  Most wildcard mass shifts will be recognizable by an expert; hence, a wildcard can be used to discover which known modifications should be enabled in a subsequent search.  In the pictured example, the user did not use a wildcard.   For more details about the Wildcard Search, see our Application Note.

By specifying most modifications as rare, it is quite feasible to search for 10 – 20 modification types at once with Byonic.  Even larger searches are possible with focused protein databases, for example with therapeutic proteins.  Such a focused database easily allows efficient mutation searches with 200+ possible substitutions or oxidative footprinting searches with 50+ types of oxidations.  Glycans and wildcards can easily enlarge the search space by 2 – 3 orders of magnitude, so these options should be used with care, and in conjunction with only the most common variable modifications such as oxidized methionine or pyro-Glu N-terminus.  NOTE:  The single most important factor in search time is Total common max.  Roughly speaking, the search time grows as CT where C is the number of common modifications enabled and T is Total common max.

The Appendix of this Manual provides examples of frequently found modifications and appropriate syntax for including those modifications in a Byonic search.


3.3 Glycans

Byonic_Fig4

Figure 4.  Byonic offers three ways to define glycan modifications:  internal preset tables, external glycan databases, and user-defined glycans.

The next tab after the Modifications tab is the Glycans tab.  Clicking on the Enter/Edit button in this tab pops up a window labeled Select Glycans as seen in Figure 4.  The top part of the window (circled in red) allows the user to input a set of glycans all at once.  The user chooses Glycan type (N- or O-linked), browses for a text file of glycan compositions, and then sets Fine Control (rare1, common2, etc.).  The text file gives one glycan composition per line; for example, the following gives five of the most common human O-glycans.   Spaces between monosaccharides are optional, and unused monosaccharides can be left out or included with zero (0) occurrences.

HexNAc(1) Hex(0)

HexNAc(1) Hex(1) Fuc(0) NeuAc(0)

HexNAc(1)Hex(1)Fuc(0)NeuAc(1)NeuGc(0)

HexNAc(1)Hex(1)Fuc(0)NeuAc(2)NeuGc(0)

HexNAc(1)Hex(1)Fuc(1)NeuAc(0)NeuGc(0)

The bottom part of Select Glycans allows the user to input glycans one at a time by specifying monosaccharide compositions.  Byonic allows six monosaccharide residues:  HexNAc, Hexose, Fucose, Pentose (common in plants), NeuAc, and NeuGc (common in non-humans).  Byonic also has a box for Sodium because this is a common adduct on sialic acids.  Other glycan masses and modifications such as sulfation and acetylation can be defined with the Additional mass box; this mass is added to the mass of the monosaccharides.  For some helpful examples and best practices for conducting N-linked and O-linked glycan searches, see our Application Notes.


3.4 S-S,Xlink

The fourth tab is the S-S, Xlink tab, new in version 2.6.  This tab allows the user to search for disulfide-bonded peptide pairs, trisulfide-bonded (also called persulfide-bonded) pairs, and more general cross-linking.  This tab provides options to allow a user to search for expected and unexpected disulfide bonds.  By checking the checkbox beside Disulfide on the left-hand side, and designating numerically which protein sequences from the FASTA database to consider in the box below boxed in Red in Figure X, Byonic will do an in silico digestion based on the digestion parameters designated earlier, and consider every peptide that contains a cysteine and look to pair with other cysteine-containing peptides.  The separation of numbers in the “For FASTA proteins” field below indicates how Byonic should consider the potential pairings.  For example, “1” searches for crosslinks in the first protein only.  “4,5; 7” searches for all potential crosslinks on the 4th, 5th, and 7th protein, #4 and $5 may crosslink to each other, but not with #7.

The Trisulfide option allows a user to search for trisulfides within a single peptide and linking 2 peptides.  Similarly, the Crosslink: DSS and Crosslink: Custom allow a user to search for crosslinks within a single peptide and linking 2 peptides.

The button “Generate details” allows a user to show all the linked peptides and mass deltas that Byonic will search for; to fine-tune the search, the user may edit the text.

Byonic_FigS-S


3.5 Advanced Options

Byonic_Fig5

Figure 5.  GUI view with the Advanced tab open.

Figure 5 shows the fifth tab, simply labeled Advanced.   This tab helps Byonic cope with imperfect inputs.   For example, on many MS instruments precursor ion charges are uncertain for some or all spectra.  By default, Byonic will use the assigned charge for all spectra with assigned charges and use +1, +2, +3 for all CID spectra and +2, +3, +4 for all ETD spectra without assigned charges.   The Apply charges box allows the user to override this default setting.  If the check box is checked, then Byonic will apply all comma-separated charges in the box to each spectrum, +2, +3, +4 in the Figure.

Similarly, on many instruments the nominal precursor mass may actually be the mass of a 13C isotope peak rather than of the base (all 12C monoisotopic) peak, so the true precursor mass will within 10 ppm of 2350.120 Da or within 10 ppm of 2351.123 Da.   Precursor isotope off by x is a pulldown menu with three options:  No error check, which will use only the assigned precursor; Off by one or two, the default, which will allow the assigned precursor to be up to 2 Da too high; and Off by one or more, which will allow the assigned precursor to be up to n Da too high for a precursor of mass at least 1000n Da.

Byonic can also calculate the precursor and charge assignments directly from the MS1 data or use the originally assigned values.  Byonic is now also able to consider multiple precursors per scan.  The middle pane of the Advanced tab offers options for filtering the peptide-spectrum matches (PSMs) by score.  By default, Byonic defers PSM filtering until after protein ranking, and then filters to control PSM FDR on the “true” proteins—those ranked above the top-ranking decoy protein.  This method gains sensitivity while simultaneously reducing both protein and PSM FDRs.  (Two-dimensional target decoy strategy for shotgun proteomics, Journal of Proteome Research 10 (12), 5296-5301, 2011.)  To filter PSMs before protein ranking, the user can uncheck the Automatic score cut box and type in a minimum Byonic score.  For example, a score threshold of 200 will remove weak matches and a threshold of 400 will remove all but the best matches.  Filtering by score may be helpful in special cases, for example to eliminate from consideration all but the best wildcard PSMs.

In the rightmost pane of the Advanced tab there is a checkbox labeled “Create focused database”.  Checking this box directs Byonic to output a new FASTA file (labeled focused and appearing in the output objs directory) containing only the proteins found in the search, along with suitable decoys (>Reverse) for unbiased FDR estimation.  The focused database can then be used for subsequent wide searches, including more modifications and/or a wildcard.  Of course, the user can also create focused databases outside of Byonic by editing existing FASTA files.  The rightmost pane also gives the user control of the protein list cut-off.  By default, Byonic cuts the protein list at 1% protein FDR or 20 decoy proteins, whichever comes last, but the user can ask for 2% protein FDR or a completely unfiltered (but still ranked) protein list.

Due to Byonic’s many options and capabilities, writing modification rules and setting parameters can be a nontrivial task.  For this reason, Byonic allows the user to save all inputs using the button labeled Save parameters and load a previously saved input, which can then be edited, using the button labeled Load parameters.  Reset parameters blanks out the modification rules and restores defaults.


3.6 Progress

The final tab is the Progress tab, which is also new for version 2.6.  When the program is launched and running, this tab will show the progress bar of the current search.


4 Byonic Output

Byonic writes its outputs (with filename extension .byrslt) that can then be viewed and explored interactively by Byonic’s output Viewer, a separate program from the Byonic search engine.  Byonic also writes the output data to an Excel spreadsheet (.xlsx) for viewing, sorting, importing into other programs, and sharing with collaborators.  In both output formats, Byonic organizes its findings into two lists, one for proteins and one for PSMs (peptide-spectrum matches).


4.1 Excel Output

Figure 6 shows Sheet 1 (Summary) of the Excel output, which gives critical statistics such as numbers of proteins, peptides, and spectra.  The two other Excel sheets give the Protein view and the Spectrum (or more precisely, PSM) view.  Although most of the information is the same between the Excel output and the Viewer, unique to the Excel spreadsheet is certain information on the Summary page, including some informative statistics, the search parameters, and two figures—Protein Score Plot and Precursor Mass Errors.

Byonic_Fig6

Figure 6.  Byonic output (Summary worksheet) as viewed in Excel.


4.2 Viewer Output

Figure 7 shows Byonic’s Viewer, which includes four interactive views:  (1) Protein List in the upper left pane, (2) Protein Coverage map for the selected protein in the lower left, (3) Peptide List (PSMs) for the selected protein(s), and (4) Annotated Spectrum for the selected PSM.  These views are interconnected:  changing the selected protein in view (1) changes views (2) and (3) and changing the selected PSM in (2) or (3) changes the annotated spectrum in view (4).  The user can select all proteins at once with the checkbox at the top left, next to Prot Rank; this then fills view (3) with all PSMs.  The four views are dockable and rearrangeable for a customized screen layout—especially useful for double-headed displays.  Figure 8 shows the Spectrum view undocked (detached from the other panes) and resized.  A selection under Windowon the top bar restores the default layout.

The user can rearrange columns in the protein and peptide lists, as well as hide/show columns and adjust their widths for optimum viewing.  To hide/show columns, right-click on the headings bar to open the Header Editor.  Mousing over column headings and icons brings up Tool Tips.  The protein and peptide lists can be sorted by any column value by clicking on the column header, and the lists can be filtered using the text boxes on the top bars.  For example, to find all phosphopeptides, filter for peptides containing the string [+79.966].  The Viewer layout persists upon exit, and customized layouts can be saved (under Window) as small files with the suffix .ini.

The peptide list includes a large number of possible columns.  The most important ones are the peptide sequence, Byonic score and log prob, and scan number (most often found in the comment column).  In the case of an identification that includes a glycan from the predefined glycan tables, the glycan’s monosaccharide composition is given by a sequence of 6 numbers, the numbers of HexNAc, Hex, Fuc, NeuAc, NeuGc, and sodium.  The column header NHFAGNa serves as a mnemonic for the composition code.   The various numbers in the peptide list columns are explained in more detail below.

The annotated spectrum view allows the user to inspect identified peaks and associated fragment errors for manual validation of identifications and modification placements.  Byonic annotates the most commonly observed ion series (a, b, y, c, z, b++, y++, etc.) for the type of fragmentation, along with common neutral losses from the precursor (e.g., M–98  for loss of phosphoric acid) , and charged losses from glycans.  One of Byonic’s annotations is non-standard:  ~y7 means y7 with loss of labile modifications (e.g., O-glycans).

The spectrum view includes a vertical-line cursor, which is movable by the mouse and allows the user to line up annotated peaks with their associated m/z errors.  When the cursor is positioned exactly over a spectrum peak, the reported m/z and intensity of that peak is shown inside curly braces {  ,  }.  When the cursor is not directly over a peak, these numbers give the m/z and intensity (x, y coordinates) of the cursor position.  The spectrum view also presents a cleavage diagram for the amino acid sequence, which is a standard summarization of the evidence for the identification.  There are buttons that turn on/off the peak and cleavage diagram annotations.  Finally, there are buttons for zooming in (magnifying glass with a +), zooming out (magnifying glass with a -), panning (rosette of compass directions), and a reset button (magnifying glass with 1:1).

 

Byonic_Fig7new

Figure 7.  Byonic output as viewed in Byonic’s interactive Viewer.

 

There is a toggle-button in the upper left corner of the Spectrum view in Figure 7.  When the Ion Table is shown, by default, the observed values are shown in RED.  The user can optionally expand to show observed and Delta masses in separate columns.  There is an export button that allows an analyst to export the table to the clipboard.

Byonic_Fig7b

Figure 8.  Byonic’s Viewer with the Spectrum view (MS2 plot and fragment errors) undocked.

 


4.3 Output Field Descriptions

The Excel spreadsheet and the interactive Viewer are two ways of exploring the same content.  The following descriptions apply to both forms of output.

Protein List.  Byonic outputs a protein list ranked by the protein p-value.  The protein p-value is the likelihood of the PSMs to this protein arising by random chance, according to a simple probabilistic model.  For example, a protein p-value of 10–3 = 0.001, or one chance in a thousand, means that in a search against a database containing 10,000 independent proteins, we expect to see only about ten p-values better than 0.001 arising at random.  Byonic’s p-values are only as accurate as the probabilistic model, however, so the user should also check the ranking of the proteins relative to the decoy (>Reverse) proteins.  Byonic actually outputs the absolute value of the logarithm base 10 of the p-value.  Logarithms are useful to prevent numerical underflow, and the absolute value offers compatibility with follow-on tools that expect a larger-is-better score.  Byonic’s protein list includes a number of columns that reflect the quality of the PSMs for each protein:

|Log Prob| – Absolute value of the log base 10 of the protein p-value

Best |Log Prob| – Largest |Log Prob| of an individual PSM assigned to the protein

Best Score – Largest Byonic score of a PSM assigned to the protein.  For “one-peptide-per-protein” samples, for example topdown proteomics or N-glycopeptide samples, sorting by Best Score may give a better protein list (lower-ranking decoys) than protein |Log Prob|.

Total Intensity – Sum of all fragment peak intensities over all MS/MS spectra

# of spectra – Total number of PSMs, including duplicate PSMs

# of unique peptides – Total number of PSMs, discounting duplicates.  (The same modification differently placed counts as a distinct PSM.)

Coverage % – Percent of the protein sequence covered by PSMs

Byonic lumps proteins P1, P2, … into a protein group if exactly the same spectra match all the proteins in the group.  An “ambiguous” PSM, meaning one matching a peptide that is found in two or more proteins, is always assigned to the higher-ranking of two proteins it matches.  Thus, if P1 has separate evidence but P2 does not, P2 will not be shown, and if P1 has a lot of separate evidence but P2 has only a little, then P1 will be ranked according to all its evidence, but P2 will be ranked according to only its separate evidence.  For this reason, as well as many other reasons, none of Byonic’s outputs (# of spectra, total intensity, etc.) is an accurate measure of protein abundance.

Peptide-Spectrum Match List.  As initially presented, Byonic’s PSM list is organized by protein, and left to right (that is N- to C-terminus) by starting position within proteins.  The user can also sort the list by other columns, for example, in order to see how the scores of phosphopeptides or glycopeptides compare to the scores of decoy peptides, or to see how PSM scores vary across all proteins.  The PSM list includes the following columns, as well as some others that require no explanation:

Off-by-x Error – [MObserved – MComputed], where MObserved is the observed M+H (singly charged) precursor mass and MComputed is the computed M+H precursor mass, and [ ] means closest integer.

Mass Error (ppm) – 106 x (MObserved – MComputed) / (MComputed).  The ppm mass error is computed after correcting for off-by-x errors.

Starting Position – Position within the protein of the N-terminal residue of the peptide.

Cleavage – Digestion specificity, where Specific means fully specific, Nragged means nonspecific at the N-terminus, Cragged means nonspecific at the C-terminus, and Non means nonspecific at both termini.

Score – Byonic score, the “raw” indicator of PSM correctness.  Byonic scores reflect the absolute quality of the peptide-spectrum match, not the relative quality compared to other candidate peptides.  Byonic scores range from 0 to about 1000, with 300 a good score, 400 a very good score, and PSMs with scores over 500 almost sure to be correct.

Delta – The drop in Byonic score from the top-scoring peptide to the next distinct peptide.  In this computation, the same peptide with different modifications is not considered distinct.

DeltaMod – The drop in Byonic score from the top-scoring peptide to the next peptide different in any way, including placement of modifications.  DeltaMod gives an indication of whether modifications are confidently localized; DeltaMod over 10.0 means that there is a high likelihood that all modification placements are correct.

|Log Prob| – The absolute value of the log10 of the posterior error probability (PEP).  The PEP takes into account the Byonic score, delta, precursor mass error, digestion specificity, and so forth (10 features in all).  For PSMs with non-negligible error probabilities, say error probabilities > 0.0001, and hence |Log Prob| < 4.0, PEPs are in good agreement with “local FDR” measured by the decoy sequences.  |Log Prob| incorporates more evidence than Byonic scores, so that sorting by |Log Prob| will almost always give a better ROC curve than sorting by score.

PEP 1D – PEP stands for “posterior error probability”. Byonic collects statistics for “true” and “false” PSMs: score, delta, precursor mass error, digestion specificity, and so forth (10 features in all). “True” PSMs are defined to be high-scoring PSMs from high-scoring proteins, and “false” PSMs are defined to be decoy PSMs. (This is “semi-supervised” statistical learning, because decoys accurately model false PSMs but high-scoring PSMs from high-scoring proteins are only an approximation to true PSMs.) PEP 1D is the probability that a PSM came from the “false” distribution rather than the “true”.

FDR 1D – Byonic ranks all PSMs, target and decoy, by increasing PEP 1D. FDR 1D for the PSM with rank n is the number of decoy PSMs with rank at most n divided by the number of target PSMs with rank at most n. (If FDR 1D ever goes above 1.0, we simply report 1.0.) Notice that FDR 1D is not monotonic with increasing PEP 1D; each decoy causes FDR 1D to jump up and then a string of targets causes FDR 1D to decline until the next decoy.

FDR 2D – One deficiency in PEP 1D and FDR 1D is that they do not take into account the protein of origin for a PSM. A low-scoring PSM to a top protein is more likely to be true than a low-scoring match to an arbitrary protein. Assuming that decoy PSMs come from reversed proteins, a two-dimensional ranking procedure can compute a “protein-sensitive” FDR. First cut the PSMs by protein of origin to an acceptable protein FDR; then add into the list of remaining PSMs all decoys coming from reverses of accepted proteins; rank the augmented PSM list by PEP 1D; and finally define FDR 2D for the PSM with rank n to be the number of decoy PSMs with rank at most n divided by the number of target PSMs with rank at most n.

PEP 2D – We can also compute a two-dimensional version of PEP by giving a bonus to PSMs from top proteins (proteins ranked above the top decoy protein). The chance that a random PSM hits such a protein is roughly the fraction of the protein database occupied by top proteins, so we simply multiply PEP 1D by this fraction to obtain PEP 2D. In order that the bonus not cut off abruptly at the top decoy protein, we fade out the bonus in a band around the top decoys.
FDR Unique 1D – Rank PSMs by PEP 1D. Remove all PSMs that are identical to higher-ranked PSMs to form a list of unique PSMs. FDR Unique 1D for the unique PSM in rank n (and for all its now-removed duplicates) is the number of decoy unique PSMs ranked at most n divided by the number of unique target PSMs ranked at most n.

FDR Unique 2D – Rank PSMs by PEP 2D and then follow the procedure for FDR Unique 1D. FDR Unique 1D and 2D are better measures of false discovery rate than FDR 1D and 2D; duplicates should be discounted because 100 duplicates are only a single “discovery”.

q-value 1D – Target/decoy-based FDR suffers from two small flaws: (1) a PSM can have lower PEP but higher FDR than another PSM, and (2) the FDRs of PSMs ranked above the top decoy are all zero, even though these PSMs may vary from perfect matches down to merely okay matches. q-value 1D is a version of FDR 1D that corrects both flaws may lose some accuracy at higher FDR values due to the monotonicity requirement.

q-value 2D – A version of FDR 2D that corrects the flaws mentioned above.

# of unique peptides – Total number of PSMs for the protein that “owns” this PSM, discounting exact duplicates.  The same peptide with the same modification differently placed counts as a distinct PSM.

Comment – The comment= value is from the TITLE= line in the .mgf.  Depending upon what software wrote the .mgf, the TITLE= line sometimes contains the scan number from the raw file.

Scan # – This is the scan number for the MS/MS spectrum.

Scan time – This is the time of the scan, usually reported in seconds (depending on the input data file format).  This is usually also considered the retention time (R.T.) in LC-MS experiments.


5 More on Modifications

Here we provide more details about Byonic’s modification rules.

Known Masses.  As a convenience for custom (typed-in) modification rules, Byonic maps certain integer masses to commonly used exact masses.  For example, Byonic automatically maps +16 to +15.994915, the monoisotopic mass of oxygen.   Byonic maps the following (6 digits internally, but only 3 digits shown here):  1 @ 0.984, –1 @ –0.984, 14 @ 14.016, 16 @ 15.995, –17 @ –17.027,  –18 @ –18.011, 22 @ 21.982, 28 @ 28.031, 32 @ 31.990, 42 @ 42.011, 43 @ 43.006, 48 @ 47.985, 57 @ 57.021, 58 @ 58.005, 80 @ 79.966.   As further shorthand, the + before a positive mass delta is unnecessary, so that S[+80] and S[80] are both acceptable.

Bundled Modifications.  A modification rule can specify more than one target residue at a time.  For example, the one rule

Phospho / +79.966331 @ S, T | common2

is identical to the two rules

Phospho / +79.966331 @ S | common2

Phospho / +79.966331 @ T | common2

Using the old syntax, one would write  [ST][79.9663], common2.  Notice that the new syntax requires a comma separating S and T, but the old  syntax requires no comma.

Protein Terminal Modifications.  The following rule specifies acetylation on the protein, but not peptide, N-terminus:

Acetyl / +42.010565 @ Protein NTerm | rare1

In the old syntax, we write  N-terminal [42.011], rare | ProteinTerminalMod

Protein-Specific Modifications.  The following rule specifies hydroxyproline only on proteins that include the string “collagen” in their FASTA file names.

Oxidation / +15.994915 @ P | common3 | ProteinLabel[icase]{collagen}

The last field is an example of a modification attribute; to add an attribute pull down the menu circled in red in Figure 3 above.  In the old syntax, we write

P[15.995], common3 | ProteinLabel[icase]{collagen}

The keyword [icase] specifies case-insensitive match, so that collagen will also match “Collagen”, “procollagen”, “collegenase”, etc.  The keyword [case] specifies case-sensitive match.  To match a single protein, use a unique identifier such as the accession number.  Protein-specific modifications offer even finer control of the size and focus of a Byonic search and can be extremely useful when the search space would otherwise be unnecessarily large, as mentioned below.  .

Glycopeptide searches.  The fully automatic, pre-set, glycopeptide searches (the checkboxes under Glycan preset tables) allow only one glycan per peptide.  The limitation of one glycan per peptide is not a severe restriction for N-linked glycosylation, because few peptides contain two N-glycosylation motifs, that is, two occurrences of NX{S/T}.  This limitation is, however, a serious restriction for O-linked glycosylation, especially mucin glycosylation, and for O-GlcNAc-ylation on S/T, because sites for these modifications often cluster together.  The following rule gives a reasonable O-GlcNAc search.

HexNAc / +203.079373 @ S, T | common3

Alternative forms for the same rule are HexNAc @ S, T | common3 (letting the software compute the mass) and HexNAc @ OGlycan | common3.  For a faster but narrower search, use HexNAc @ S, T | common2, or for an intermediate search, designate the modification as both common2 and rare1.  Here is a slightly expanded search rule:

HexNAc / +203.079373 @ NGlycan, S, T | common3

NGlycan as a modification target means the asparagine in the N-glycosylation motif NX{S/T}, where X is any residue except proline.  The expanded rule searches for truncated N-glycans; with the earlier rule a peptide with a truncated N-glycan is likely to be mistaken for an O-GlcNAc peptide.  With the old syntax, one would use two rules[ST][+203.079], common3 and NGlycan[+203.079], common3 for the expanded rule above.

The NGlycan keyword enables the user to search for more than one N-glycan per peptide.   The keyword also enables the user to customize the N-glycosylation search to specific glycan masses, rather than rely on Byonic’s predefined tables.  For example, a researcher may have already acquired detailed knowledge of the glycan masses through detached glycan analysis.  For human blood serum samples, the most common N-glycan is usually HexNAc(4)Hex(5)NeuAc(2) at 2204.77 Da.

Mucin-type O-glycosylation sites tend to cluster, so that a single tryptic peptide may have 5 or more glycosylation sites.  Searches allowing 4 or more common modifications are computationally intensive, so these searches are best run with small protein databases and limited lists of glycan masses.  With Total common modification max set to 5, a search involving 10 proteins (typically the protein of interest, along with trypsin and contaminants) and 10 likely O-glycans (for example, HexNAc(1), HexNAc(1)Hex(1), HexNAc(1)Hex(1)NeuAc(1), HexNAc(1)Hex(1)NeuAc(2), HexNAc(2)Hex(1), etc.) may already be an undesirably large search.  Making the glycosylation protein-specific using theProteinLabel attribute is one way to help control the combinatorial explosion.

Byonic allows the user to combine handcrafted modification rules with one or both of the predefined glycopeptide searches.  One strategy is to use the predefined searches for exploration, and then iteratively move the modification masses found in confident identifications into handcrafted rules in order to focus the range of modification masses while simultaneously increasing the number of allowed instances of each modification.

Wildcard mass defect.  If the precursor mass tolerance is tight (less than 100 ppm) Byonic obtains the exact mass of the wildcard from the difference between the observed precursor mass and the calculated mass of the candidate peptide.  If the precursor mass tolerance is high (≥ 100 ppm), Byonic uses a mass defect (fractional part of the mass) characteristic of an organic molecule, specifically 0.05 per 100 Da.   In either case, Byonic will show the wildcard mass it used in the output.  With precursor masses good to 5 ppm or less, a modest-size wildcard, say in the range –40 Da to 40 Da, may pinpoint the elemental composition.


6 False Discovery Rate

Internally Byonic retains all PSMs and all proteins.  For presentation to the user, however, Byonic by default cuts the protein list after the 20th decoy protein or at the point in the list at which the protein FDR first reaches 1%, whichever cut gives more proteins.  On most data sets, almost all the true proteins will be on this truncated list.  Researchers with special knowledge of their samples, along with the time to examine protein identifications manually, may choose to make a lower protein cut using the options on the Advanced tab of the Byonic input window.

After making the protein cut, Byonic makes a PSM cut in order to keep the spectrum-level FDR to a reasonable level, that is, in order to discard false matches to the reported proteins.  (See “Two-dimensional target decoy strategy for shotgun proteomics,” by M. Bern and Y. Kil in J. Proteome Res., 2011, vol 10(12), 5296-5301. PMID 22010998.) By default Byonic makes the PSM cut by discarding the n PSMs with the lowest Byonic scores, where n is the expected number of random PSMs matching the reported proteins.  Byonic then estimates the spectrum-level FDR of the remaining PSMs to the reported proteins; this FDR will typically be in the range 0 – 5%.  Alternatively, using the Advanced tab, the user can ask Byonic to make a manual score cut in order to obtain a desired spectrum-level FDR.  Warning:  A significant number of true PSMs may be lost by imposing an arbitrary low FDR limit rather than accepting Byonic’s automatic cutoff.

And a final word of caution:  The target-decoy strategy gives an accurate estimate of the rate of completely false identifications at both the protein and PSM levels, but it does not give any estimate of “partially correct” identifications, for example, wrong homologues or splice variants in the case of proteins, or misplaced or incorrect modifications in the case of peptides.  Peptide identifications that may be only partially correct can usually be recognized by low DeltaMod (often zero), and manual inspection of the annotated spectrum may in some cases resolve the ambiguity.  Even with all of Byonic’s advances over the previous state of the art, human judgment remains the ultimate arbiter of subtle identification clues.


7 Appendix – Common Modifications

Below are examples of often encountered modifications and the appropriate syntax for including those modifications in a Byonic search.

 

Table 1.  Cysteine Treatments + Artifacts of Treatment

The modifications in blue apply to samples with the cysteine treatment in black above them.  The syntax for the rules from the pull-down menu is also accepted in the custom modification entry box as well as the “old syntax” pre-dating the new pull-down menu interface.

Rule from Pull-Down Menu Explanation
Carbamidomethyl / +57.021464 @ C | fixed Iodoacetamide treatment
(De)Carbamidomethyl / -57.021464 @ C | common1 Under-alkylation
Ammonia-loss / -17.026549 @ NTerm C | rare1 Pyro-glu from camC
Methyl / +14.01565 @ C | rare1 Propionamide (from gel)
DTT / +151.996571 @ C | rare1 Total mass delta  209 Da
Carbamidomethyl / +57.021464 @ NTerm, H, K | common2 Over-alkylation
Carboxymethyl / +58.005479 @ C | fixed Iodoacetic acid treatment
DTT / +151.996571 @ C | rare1 Total mass delta  210 Da
Methylthio / +45.987721 @ C | fixed MMTS treatment
Propionamide / +71.037114 @ C | common2 Common in gel samples

 

Table 2.  Other Chemical Treatments

Mixtures of isotopically labeled and unlabeled peptides (e.g., SILAC) are best run as two searches with fixed modifications, rather than as variable modifications, because peptides should be either completely labeled or completely unlabeled.

Rule from Pull-Down Menu Explanation
Propionyl / +56.026215 @ NTerm, K | fixed Propionylation
TMT / +224.152478 @ NTerm, K | fixed TMT0 labeling
TMT2plex / +225.155833 @ NTerm, K | fixed TMT2 labeling
iTRAQ4plex / +144.102063 @ NTerm, K | fixed iTRAQ labeling
Label:13C(6)15N(2) / +8.014199 @ K | fixed Heavy Lysine for SILAC
Label:13C(8)15N(2) / +10.020909 @ R | fixed Heavy Arginine for SILAC

 

Table 3.  Common In Vitro Modifications

Rule from Pull-Down Menu
Deamidated / +0.984016 @ N, Q | common2
Carbamyl / +43.005814 @ NTerm | common1
Gln->pyro-Glu / -17.026549 @ NTerm Q | common1
Glu->pyro-Glu / -18.010565 @ NTerm E | common1
Delta:H(2)C(2) / +26.01565 @ NTerm | common1
Oxidation / +15.994915 @ M | common2
Dioxidation / +31.989829 @ M, W | rare1
Trioxidation / +47.984744 @ C | rare1
Dethiomethyl / -48.003371 @ M | common2
Cation:Na / +21.981943 @ D, E | common1
Methyl / +14.01565 @ E | common1
Dehydrated / -18.010565 @ S, T | rare1

 

Table 4.  Common In Vivo (Biological) Modifications

Rule from Pull-Down Menu
Oxidation / +15.994915 @ P | common3 | ProteinLabel[icase]{collagen}
Phospho / +79.966331 @ S, T | common3
Phospho / +79.966331 @ Y | common2
Methyl / +14.01565 @ K, R | common2
Dimethyl / +28.0313 @ K | common2
Acetyl / +42.010565 @ K | common2
Trimethyl / +42.04695 @ K | common2
Acetyl / +42.010565 @ Protein NTerm | common1
Amidated / -0.984016 @ Protein CTerm | rare1
Sulfo / +79.956815 @ C, S, T, Y | common2
HexNAc / +203.079373 @ S, T | common2
HexNAc(1)Hex(1) @ OGlycan | common2
HexNAc(4)Hex(5)NeuAc(2) @ NGlycan | rare1
HexNAc(4)Hex(5)NeuAc(2)Sodium(1) @ NGlycan | rare1
Hex / +162.052824 @ K | common2

Is Byonic a search engine like Mascot, SEQUEST, etc.?
Yes, Byonic can do anything Mascot, SEQUEST, etc. can do, and more besides.

 

Why is Byonic better?
Byonic extends the state-of-the-art in two ways: its scorer is more accurate and allows a much wider range of search possibilities. The “scorer” is the algorithm that matches peptides to mass spectra by comparing the predicted fragmentation of the peptide with the observed peaks in the spectra, in consideration of the precursor mass accuracy and precision. Byonic incorporates a substantial amount of chemical knowledge into its fragmentation prediction, such as reduced CID fragmentation on the C-terminal side of proline and strong neutral losses from certain modifications. This expert system knowledge means that Byonic is more sensitive (more true positives) and specific (fewer false positives) than other search engines for exactly the same search as objectively measured using a decoy database strategy.

Moreover, Byonic enables a wider variety of searches than the other search engines. With Modification Fine Control™, Byonic can search for 10s or even 100s of modification types simultaneously without a prohibitively large combinatorial explosion. Byonic’s Wildcard Search™ allows the user to search for unanticipated or even unknown modifications alongside known modifications. Finally, Byonic’s glycosylation search allows the user to identify glycopeptides without prior knowledge of glycan masses or glycosylation sites.

 

How much better is Byonic?
The performance advantage depends upon the data, the search, and the skill of the user. On exactly the same search, Byonic’s advantage ranges from about 20% to 300% more spectra identified at the same False Discovery Rate (FDR), with the advantage increasing with the difficulty of the search. Easy searches are ones with small search spaces (such as due to a small protein database or a small number of modification types) and difficult searches are ones with large search spaces (such as due to nonspecific digestion, a large number of modification types, and low precursor mass accuracy).

 

What’s the difference between Byonic and Preview?
Preview offers an initial peek preview of your data to help you set the parameters for a much more sensitive Byonic search. Preview advises the user on mass accuracy, digestion specificity, and the prevalence of ~ 60 common modifications. Preview optionally recalibrates the m/z measurements to improve sensitivity/specificity for subsequent searches (using any search engine).

 

How should I prepare my data for Byonic?
Nothing is required.  Byonic takes in raw data in raw format from Thermo, Waters, Sciex, Bruker, and Agilent.  Optionally, you may want to run the data through Preview, and if the scatter plots of mass error vs. m/z for precursor and fragment mass errors reveal systematic m/z measurement errors, ask Preview to recalibrate the data and return a new m/z-recalibrated MGF file.

 

Should I de-isotope my data?
No! Byonic handles isotope peaks internally. De-isotoping the spectra beforehand using, for example, Mascot Distiller, destroys valuable information. De-isotoping is an especially bad idea for ETD spectra, which often have peaks (c–1 and z+1 peaks) that lead de-isotoping algorithms astray.

 

How should I choose search parameters?
Set mass tolerances appropriate for the type of instrument, for example, 10 ppm precursor tolerance for a high resolution instrument and 0.3 Dalton fragment tolerance for ion trap fragmentation. Preview’s mass error plots can help you choose these tolerances. Preview’s m/z recalibration can remove systematic errors so that data can be run with tighter tolerances, for example, 5 ppm instead of 10 ppm tolerance for a high resolution instrument. Tight tolerances offer significant advantage for difficult searches, for example, resolving nearly isobaric modifications such as sulfation and phosphorylation, or identifying glycopeptides with poor fragmentation. Tolerances can be set in either Da or ppm, as appropriate for the instrument.

Set digestion specificity based on the prevalence of nonspecific digestion and the complexity of the search. If the modification complexity of the search is high, as in wildcard, glycosylation, or oxidative footprinting searches, it is best to avoid the extra complexity of searching for nonspecific digestion, unless the nonspecific digestion rate is high (say, over 20% of all peptides).

Set modifications based upon prevalence reported by Preview and the goal of the study. If the goal of the study is phosphorylation site identification, enable up to 3 or even 4 phosphorylation sites per peptide, and avoid other modifications unless they are prevalent. If the goal of the study is simply protein identification, it is best to enable only the most common modifications (for example, oxidized methionine and deamidation). Be especially alert to over-alkylation; in some samples, over-alkylation is so common that the majority of peptides carry iodoacetamide artifacts. Some modifications are more costly (for example, sodiation on any residue as opposed to just E and D), but others (such as pyro-glu on N-terminal glutamine) barely increase the size of the search space.

 

What is a focused database?
Byonic enables the user output a small protein database, containing only the higher –ranking proteins (whether they be target or decoy) from a first search, along with appropriate decoys. A database specifically focused on the proteins in the sample under study improves speed and accuracy for subsequent searches. Focused databases are especially useful for wide modification searches, such as glycosylation and wildcard searches. We do not, however, advocate the use of a focused database for every study; they are unnecessary for most searches. Similarly, there is no need to make the first search very narrow; enable common modifications and nonspecific digestion as appropriate.

 

What is a wildcard modification?
A wildcard modification is nonspecific in both mass and (optionally) residue type; this is Byonic’s version of blind modification search. Adding a wildcard with mass range of 100 Daltons increases the search time approximately 100-fold, so it is faster to add a wildcard only on small searches, for example, fully tryptic searches with few other modifications. However, wildcard modification searches can be specified to apply only to specific residues.

 

When should I use a wildcard modification?
We use a wildcard modification most often in a final clean-up search to be sure we haven’t missed anything interesting. We take out most other modifications and search against a focused database so that the wildcard search does not take too long. Wildcard search can also be used to find sequence variants. On the other hand, wildcard matches tend to be approximate, rather than exact: the wildcard modification is often misplaced, off by one Dalton, or the sum of two closely spaced modifications. Be alert for approximate answers such as EV[–18]PQLEVTK, where –18 almost surely belongs on E not V, and L[–113]EDEFVEVTK, where the right answer is surely EDEFVEVTK. Finally, we use wildcard search to solve mystery spectra; on a well-sequenced organism almost all spectra are at most one wildcard away from a database sequence.

 

Should I search my data more than once?
Yes! It can be a good idea to bracket your search with several settings of the crucial parameters when trying to get the most from the data. Even on data with overall 5 ppm precursor mass accuracy, there will be a few valid identifications with much larger errors, due to interfering MS1 peaks, mixture spectra, and so forth.

 

Should I combine multiple search engines?
In our experiments, other search engines find very few valid spectrum identifications missed by Byonic, typically less than 1%.

 

How does Byonic compute p-values?
Byonic computes both peptide-spectrum match (PSM) and protein p-values, assuming simple models of random matches. For PSMs, Byonic assumes that random scores are independent identically distributed (i.i.d.) picks from a probability distribution with an exponential right-hand tail. This distribution depends only upon the size of the search (number of modifications, digestion specificity, size of the protein database, and so forth), and not upon the spectrum itself. Byonic reports the log base 10 of the p-value, so that a LogProb (log p-value) of –2.0 should occur by chance on only about one out of 100 spectra.

For proteins, Byonic computes the expected total LogProb of PSMs hitting each protein, assuming that random PSMs are distributed uniformly over the protein database. The protein LogProb is the excess of total LogProb over the expected amount (that is, how much more negative). Proteins are ranked from most confident on down according to LogProb.

 

How does Byonic estimate the False Discovery Rate (FDR)?
The False Discovery Rate (FDR) in a list of identifications (either proteins or PSMs) is the number of incorrect identifications divided by the total number of identifications. Byonic estimates PSM FDR using the target/decoy approach, which is the de facto standard for significance testing in proteomics. We have devised a method called two-dimensional FDR (2D FDR) http://www.ncbi.nlm.nih.gov/pubmed/22010998 that can take into account protein-level information when computing PSM FDR, without biasing the FDR estimate. Two-dimensional FDR gives greater sensitivity/specificity than other methods because it can retain lower scoring PSMs to high-ranking proteins (which are likely to be correct) yet discard higher scoring PSMs to low-ranking proteins (which are likely to be incorrect).

The following installation packages integrate Preview/Byonic into Thermo Fisher Scientific’s Proteome Discoverer software as nodes. There is no cost for the nodes, but running searches using the nodes requires that the standalone versions of Preview/Byonic be installed and licensed.

Table of Contents

1. Quick Tutorial

Preview Dialog Quick Tutorial

1. First, please install and register to activate Preview.

2. Provide (1) the spectrum file in .mzML, mzXML, Thermo RAW, or .mgf format, and (2) the database in FASTA format. Then you simply press the Run button (3). Once the process finishes, a browser window will open to display the result. Note that you will be asked to select a folder to place the resulting files (which you may do later).

3. You may optionally provide (a) fixed modification information and (b) options such as digestions, fragmentation, and whether to produce a recalibrated file (output file is in .mgf format currently, and written in the results folder). To learn about these options in detail, read below and also the tool tips (by hovering the mouse cursor over the text within the application’s UI), or see the help page.

 

 

 

2. Guided Tutorial: Input settings

We will guide you through a sample run and how to use the results. When you launch Preview, you will see the user interface.
Required fields.

Required input

There are only two required inputs: (1) a set of MS/MS spectra in mzML, mzXML, Thermo RAW, or .mgf format, and (2) a protein database in FASTA format. Press the “Select MS/MS data file” and “Select Protein database file to browse for your files.”

Fixed Modifications

Fixed modification optionsYou can define fixed modifications for your sample. Common modification presets can be found by clicking on the pulldown menu. In this case, the user specified a fixed modification of +57.0215 Da on cysteine. For standard cysteine treatments (+0, +46, +57, +58, and +71), this input is not strictly necessary, because Preview can usually deduce the cysteine treatment from the data.

Preview does not attempt to deduce other fixed modifications, so the user must specify lysine and N-terminal modifications (such as iTRAQ), C-terminal modifications (such as 18O labeling), and lysine and N-terminal modifications (such as TMT, iTRAQ and SILAC). Note that you can enter, by typing, modification masses that are not already in the pull-down menu to customize according to your needs.

Digestion

Digestion optionsIn this example, the user left the settings of digestion cleavages and initial search specificity at their defaults: RK (meaning trypsin digestion) and fully specific initial search. We recommend fully specific initial search for all digested samples; nonspecific initial search may perform better for undigested (peptidomic) samples.

Search options

Charge and Wildcard optionsThe “Phospho enriched” option is used only for cases where the sample is composed predominantly of phosphopeptides; this option optimizes Preview for this situation. The “Wildcard search” option directs the program to perform a blind modification search that tries each integer mass shift from -50 to +150 on each residue. The “Try all charge assignments” option directs Preview to ignore the charge assignments in the spectrum file and run every spectrum with z = +1, +2, +3 for each CID spectrum, and z = +2, +3, +4 for each ETD spectrum. You can run Preview twice, once with this box checked and once without, to test the reliability of charge assignments. For the “Fragmentation type,” you can choose between CID/HCD or ETD/ECD.

Recalibration: In the above example, the user has chosen the default option, which is to recalibrate both the precursor and the fragment masses. This will generate a new file with “.recal.” inserted in the .mgf file name. The original file is not altered. The recalibrated file is located in the results output folder, which will be inside the folder set from the menu: Edit > Preferences.

3. Guided Tutorial: Understanding the Results

Preview runs quickly (in this case, about 15 seconds) and then produces its output. The primary output is two html pages: Summary and Details pages. In addition, Preview outputs a Byonic parameter file located in the output folder for subsequent search, with an extension .byparms. Note that this is a suggested search. In general, the parameters should be reviewed and, if necessary, modified to fit the particular experiment. Preview is good at finding modifications that are consistently found sample-wide (for example, in vitro modifications from sample handling and processing). However, Preview may miss post-translational modifications that are found only on a few proteins, and Preview does not look for glycopeptides. If you are interested in finding specific post-translational modifications or glycopeptides, you should carefully review and adjust the Byonic parameters suggested by Preview.

The first output html page is a Summary that presents Preview’s most important findings: the top proteins (up to 10), mass measurement errors, and the most prevalent modifications. From this page, the user can click to a Details page, with more details, to the results folder where there is a recalibrated spectrum file (if that option was chosen), and spreadsheets giving peptide identifications. Note that there will be fewer identifications than can be obtained with standard search engines. Preview samples the data; it does not perform a complete search

Top proteins

Result proteins

The top protein list shows that only Bruton’s tyrosine kinase was found in quantity, and indeed this sample is nominally a one-protein (therapeutic protein) sample, purified by gel electrophoresis. Preview ranks proteins using a rather complicated function of the number of unique peptides, the peptide matching scores, and so forth. In this case, only the top protein is surely in the sample, because it is the only protein with more than one distinct peptide. A few extra proteins do not hurt Preview’s assays, because they will provide few high-scoring identifications. Preview automatically adds matched decoy peptides for all searches, and it uses the decoys to estimate and correct for false discoveries and to set the score thresholds for accepting identifications. There is no need to add decoys to the input database.

Recalibration

Recalibrate plot

The plots of precursor and fragment measurement errors reveal something interesting in this example: the measurements could benefit from recalibration. The precursor measurements are running about 15 ppm too high, which is fairly large for Orbitrap measurements, and the fragment measurements are running about 0.3 Da too low, which is fairly large for LTQ measurements. Preview includes built-in recalibration, which generates a new spectrum file (at the specified by the output folder). On this data set, recalibration improves the precursor errors by about 7x to a median accuracy (absolute value) of about 2 ppm and the fragment errors by about 3x to a median accuracy of about 0.1 Da.

Note that a median precursor accuracy of 2.4 ppm does not mean that the user should specify 2.4 ppm tolerance in a search engine such as Mascot, SEQUEST, or X!Tandem. The median error is the typical error of an abundant ion, and the mass tolerance should be set to at least three times the typical error in order to catch all the valid identifications. In the case of BTK.recal.dta, we would choose a 10 ppm precursor tolerance and 0.4 Da fragment tolerance, about 4x these median errors.

Summary: Digestion and modifications

Digestions and modifications

The Summary page also gives statistics on digestion specificity and fixed and variable modifications. In this case, we see nonspecific digestion at the N-terminus, oxidized methionine, cysteine propionamide (not surprising in a gel sample), acetylated protein N-terminus, and phosphorylated serine, threonine, and tyrosine (known PTMs in Bruton’s tyrosine kinase).

The list of the most common variable modifications (above image, bottom) helps the user choose the modifications to enable for a full search, based on prevalence, biological importance, search time, and so forth. The most common variable modifications in this sample are oxidized methionine, protein (and hence peptide) N-terminal acetylation, and cysteine propionamide; the user would probably want to enable all three of these. The user may want to enable sodiation; enabling this modification does not usually increase the amount of biological information, but a search with sodiation at D, E, and C-terminus would be reasonable. The user would definitely want to enable phosphorylation in studying a kinase that is itself phosphorylated or any sample where phosphorylation is suspected to be of importance. The user may or may not want to enable deamidation and N-terminal methylation and dimethylation. Some samples have considerably more deamidation than in this example.

Detailed result

Details

The Details page lists all of Preview’s assays, including both positive and negative results. It also lists the size of the sample for each assay. For example, Preview made 73 identifications (including duplicate identifications) to peptides containing M, and 24 of these 73 included at least one M [+16]. The sample sizes let the user judge the statistical significance of the assay result.

From the Summary page, the user can click to a Details page that gives the full account of Preview’s “assays” and their results. The above image shows that Preview detected no oxidations besides methionine sulfoxide. It also reveals some amount of pyro-glu cyclization on peptides with N-terminal Q, E, and C[+57], and some beta-elimination, which in this case is probably caused by neutral loss of phosphoric acid from phosphoserine and phosphothreonine. Pyro-glu does not increase the size of the search much, because it applies only to peptides with certain N-terminal residues, and beta-elimination provides additional information about phosphorylation sites, so in this study we would enable these modifications, along with oxidized methionine, protein N-terminal acetylation, cysteine propionamide, and all three phosphorylations (S, T, Y).

Note: The Details page reports the rates of modification on eligible peptides, whereas the Summary page reports potential gains in the total number of identifications. Denominators in the percentages may also vary from search to search due to “second-order” effects such as multiply modified peptides and corrections for hits to decoys

This page describes each item in the user interface of Preview. To learn more, visit the AboutTutorial, and FAQ pages.  Also note the helpful information in the form of tool tips: simply hover the mouse cursor over the text on the application’s user interface to bring up these tool tips.

Input Files

These are the two required fields.

Select MS/MS data file

Input spectra in .mzML, mzXML, Thermo Raw, or mgf format. See Examples_to_test_with folder in Preview program folder for examples.

Select Protein database

Select Protein database in FASTA format. See sample_data folder in Preview program folder for examples.

Fixed modification options

You may provide fixed modifications. The pulldown menu for each edit box provides common modifications.  The user may also enter any modification not on the pulldown menu by typing the exact mass in the empty box.

Cysteine modification mass

Enter the mass in Daltons, of a fixed modification on Cysteine. In default mode (unselected), Preview will test for common modifications (0, 46, 57, 58, and 71).

Lysine, Arginine, N-terminal, and C-terminal modification masses

Enter the mass in Daltons of a fixed modification from the pulldown menu or typing in an exact mass. In default mode, Preview will assume 0 Da, and will not test for other possibilities.

Search Options

Digestion cleavage site(s) and side

Enter one-letter abbreviations for residues at cleavage points. Default is RK, meaning trypsin, with the cleavage side set to C-terminal side of the cleavage point.

Initial search specificity

Preview chooses representative proteins based on the initial search. Subsequent searches find modifications and non-specific digestions. Fully specific search (default) is recommended for digested samples. Fully specific (the default) means both peptide termini must agree with the input digestion cleavages. N-ragged search means that the C-terminus must agree, and C-ragged means that the N-terminus must agree. Semispecific means one terminus (either one) can disagree. Nonspecific means that both termini can disagree.

Phospho enriched

Check this box only if the sample is highly enriched in phosphopeptides, that is, the predominant composition is that of phosphopeptides.

Wildcard search

This option enables searches of the spectra with a wild-card modification, which allows any integer mass change (in the range of -50 to +120) to any one residue. The mass of the modification will be reported to the accuracy of the precursor mass.

Try all charge assignments

By default, Preview considers only the precursor charge assignments in the .dta or .mgf input, but selecting this option will force Preview to consider z = +1, +2, +3 for each CID spectrum, and z = +2, +3, +4 for each ETD spectrum. This may give better results for lower resolution ion-trap precursor scans.

Fragmentation type

Choose between CID / HCD (b and y ions) or ETD / ECD (c and z ions).

Recalibrate masses

Preview optionally creates a new MS/MS data file (.mgf format) in the results folder with recalibrated (1) precursors only, (2) fragments only, or (3) both precursors and fragments (default), or makes no recalibrated data file. The original file is not altered.

Additional Items of Note

Preferences

The Preferences dialog is located under the menu Edit > Preferences.  Use this to set the working folder for results. Resulting files such as recalibrated spectra and html results will be saved within the selected folder. Note that resulting files are never deleted; if desired to be purged, they must be removed manually.

Parameters load/save

For convenience, you can save and load your parameters, such as the modifications and digests. Note that a parameter file (params.prv) will be created for each run, located within the working folder.

Is Preview a search engine like Byonic, Mascot, or SEQUEST?
Not exactly. Preview samples the data, so it generally makes fewer identifications than a full search engine. On the other hand, it tests many more modifications and search options than any full search engine.

How should I use Preview?
Run Preview before you run any other searches, so that you will know what type of full search will be most effective.

Should I use Preview to recalibrate my m/z measurements?
Yes! If Preview makes sufficiently many identifications, say at least 20 precursors and at least 50 fragments, then you will generally get better results out of a full search with Preview’s recalibrated spectrum file than with the original spectrum file unless the original calibration is extremely good. If you have enough identifications to avoid over-fitting (say 100 or more precursors), you can even run Preview’s recalibrated spectrum file through Preview again for even more precise recalibration.

How does Preview recalibrate m/z measurements?
It maps measured m/z values to recalibrated m/z values using quadratic curves– the red curves shown in the plots. We have found that calibration does not drift much over the course of an LC-ESI run, so that the same quadratic curve works for all spectra. Calibration can change from plate to plate with MALDI, however, so it’s quite possible to see a lot of scatter in the m/z errors from a data set comprising many MALDI plates.

If Preview reports median precursor error of 2 ppm, should I set the precursor tolerance in the full search to 2 ppm?
No! The median error is the typical error for an abundant ion and at least 3 to 5 times smaller than the maximum error. Also check the number of “off-by-one” errors reported by Preview: even on high-accuracy instruments, many precursor masses may reflect the mass of the first isotope peak rather than the monoisotopic mass.

In the full search, should I enable all the modifications that Preview reports as “Common variable modifications”?
Not necessarily. Some full search engines do not support all the modifications supported by Preview. Some modifications are biologically uninteresting (for example, sodiation) and should only be enabled if they would contribute a significant number of additional identifications.

How does Preview compute False Discovery Rate (FDR)?
Preview uses the target / decoy approach to FDR estimation, and estimates the number of true identifications by the number of target identifications minus the number of decoy identifications. There is no need to add decoy proteins to the protein database, because Preview does this automatically. Preview does not report FDR, but it uses FDR internally to decide which identifications to accept.

How reliable are Preview’s statistics?
Preview’s statistics are especially good for “normal” shotgun proteomics, meaning digested multi-protein samples. Preview loses some reliability on very highly modified samples, in which many peptides carry more than one variable modification.

How can I use Preview to improve my sample processing?
Preview reports on the amount of nonspecific digestion, m/z measurement errors, and sample preparation artifacts such as over- and under-alkylation, carbamylation, oxidation, sodiation, and deamidation. This type of information can provide valuable feedback.

How should I read Preview’s peptide and protein identifications?
This list (accessible from the Detail page) gives the highest-scoring identification for each spectrum, so long as the score is high enough to be statistically significant. We don’t usually do much with this list of identifications: remember that Preview samples the data, and does NOT perform a full search.

How should I read Preview’s wildcard search results?
Preview’s wildcard is just what it sounds like: any mass shift on any one residue. Wildcard identifications are often approximate, with misplaced modifications, two modifications combined into one wildcard, two known modifications in a combination not considered by Preview’s other searches, and so forth. On the other hand, these identifications, especially if they have scores over 60, are rarely completely wrong. A wildcard search will find polymorphisms, unanticipated modifications, and mystery mass shifts in almost any sample.

Why do the Summary and Details statistics sometimes disagree?
The Summary page reports the overall gain to be achieved by enabling the modification, for example, 8.5% more identifications by allowing oxidized methionine for the BTK sample data. In contrast, the Details page reports the rate of modification, for example, 32.9% of peptides containing methionine contain at least one oxidized methionine. In other words, the Summary reports the “bottom line”, how many more identifications can be obtained by enabling the modification, while the Details page reports direct comparisons on specific, limited searches.

For example, to assess the rate of oxidized methionine, Preview searches the spectra only against methionine-containing peptides, and reports the results of the search on the Details page. Then after all searches have been done, Preview compiles the summary statistics by counting up all the identifications for all spectra.

Denominators in the percentages may also vary from search to search due to “second-order” effects such as multiply modified peptides and corrections for hits to decoys.

Preview’s statistics can lose accuracy on extreme data sets, those in which a large percentage (say 30% or more) of the peptides carry more than one type of modification: for example, a data set that is both highly over-alkylated and highly oxidized.

Where do I learn more about Preview?
Check out Preview’s Tutorial and Help webpages.  For background on why Preview was created and what is does, see the About page.