This post aims to illustrate the installation and running InterProScan on a local machine. InterProScan is frequently used in Bioinformatics data analysis the goal of which is to extract various protein features based on sequence alignment of database search. As scanning a protein sequence is time consuming even on a local machine, we install the lookup database based on which features of known protein can be directed extracted from the database.
Some useful instruction for InterProScan can be found from Google code documentation.
Make a directory for the software package
mkdir myinterproscan; cd myinterproscan
Download the software package and MD5 checksum with the following command
wget ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.13-52.0/interproscan-5.13-52.0-64-bit.tar.gz
wget ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.13-52.0/interproscan-5.13-52.0-64-bit.tar.gz.md5
Check the MD5 checksum to make sure the download is successful with the following command
md2sum -c md5sum -c interproscan-5.13-52.0-64-bit.tar.gz.md5
Unpack the InterProScan package with the following command
tar -pxvzf interproscan-5.13-52.0-*-bit.tar.gz
data
directory under the directory of InterProScan.Down load the Panther model with the following command
wget ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/data/panther-data-9.0.tar.gz
wget ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/data/panther-data-9.0.tar.gz.md5
Check the download is successful with the following command
md5sum -c panther-data-9.0.tar.gz.md5
Unpack Panther model data with the following command
tar -pxvzf panther-data-9.0.tar.gz
.tar.gz
file phobius101_linux.tar.gz
.Unpack the file with the following command
tar -xzvf phobius101_linux.tar.gz
Copy all files to the Phobius folder under InterProScan home directory with the following command
mv phobius/* interproscan-5.13-52.0/bin/phobius/1.01/
Then change the line in InterProScan configuration file ‘interproscan.properties’ to the following
binary.phobius.pl.path.1.01=bin/phobius/1.01/phobius.pl
Then you are done!
Download the .tar.gz
package and unpack with the following command
tar -xzvf tmhmm-2.0c.Linux.tar.gz
Copy all file to the TMHMM directory under InterProScan directory
mv tmhmm-2.0c/* interproscan-5.13-52.0/bin/tmhmm/2.0/
Then change the line in InterProScan configuration file ‘interproscan.properties’ to the following
binary.tmhmm.path=bin/tmhmm/2.0/bin/decodeanhmm.Linux_x86_64
tmhmm.model.path=bin/tmhmm/2.0/lib/TMHMM2.0.model
Then you are done!
Finally, change the line in InterProScan configuration file ‘interproscan.properties’ to the following
binary.signalp.4.0.path=bin/signalp/4.1/signalp
Current version of InterProScan supports several databases. The name and the version information are listed in the following table
Software and databases | Version | Information |
---|---|---|
InterProScan | 5.13-52.0 | InterProScan package |
ProDom | 2006.1 | ProDom is a comprehensive set of protein domain families automatically generated from the UniProt Knowledge Database. |
HAMAP | High-quality Automated and Manual Annotation of Microbial Proteomes | |
SMART | 6.2 | SMART allows the identification and analysis of domain architectures based on Hidden Markov Models or HMMs |
SuperFamily | 1.75 | SUPERFAMILY is a database of structural and functional annotation for all proteins and genomes. |
PRINTS | 42.0 | A fingerprint is a group of conserved motifs used to characterise a protein family |
Panther | 9.0 | The PANTHER (Protein ANalysis THrough Evolutionary Relationships) Classification System is a unique resource that classifies genes by their functions, using published scientific experimental evidence and evolutionary relationships to predict function even in the absence of direct experimental evidence. |
Gene3d | 3.5.0 | Structural assignment for whole genes and genomes using the CATH domain structure database |
PIRSF | 3.01 | The PIRSF concept is being used as a guiding principle to provide comprehensive and non-overlapping clustering of UniProtKB sequences into a hierarchical order to reflect their evolutionary relationships. |
PfamA | 27.0 | A large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs) |
PrositeProfiles | PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them | |
TIGRFAM | 15.0 | TIGRFAMs are protein families based on Hidden Markov Models or HMMs |
PrositePatterns | PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them | |
Coils | 2.2 | Prediction of Coiled Coil Regions in Proteins |
TMHMM | 2.0 | Prediction of transmembrane helices in proteins |
Phobius | 1.01 | A combined transmembrane topology and signal peptide predictor |
SignalP_GRAM_NEGATIVE | 4.0 | SignalP (organism type gram-negative prokaryotes) predicts the presence and location of signal peptide cleavage sites in amino acid sequences for gram-negative prokaryotes |
SignalP_EUK | 4.0 | SignalP (organism type eukaryotes) predicts the presence and location of signal peptide cleavage sites in amino acid sequences for eukaryotes. |
SignalP_GRAM_POSITIVE | 4.0 | SignalP (organism type gram-positive prokaryotes) predicts the presence and location of signal peptide cleavage sites in amino acid sequences for gram-positive prokaryotes |
Install lookup service
Lookup service enables extracting features of known proteins without scanning the sequences.
Lookup service can be installed in any directory with the following command
mkdir i5_lookup_service
cd i5_lookup_service
Download data files for lookup services with the following commands
wget ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/lookup_service/lookup_service_5.12-52.0.tar.gz
wget ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/lookup_service/lookup_service_5.12-52.0.tar.gz.md5
Check the MD5 checksum to make the download is successful with the following commnand
md5sum -c lookup_service_5.12-52.0.tar.gz.md5
Unpack the download package with the following command
tar -pxvzf lookup_service_5.12-52.0.tar.gz
This takes very long, so be patient.
In scanning mode
Some instruction of running InterProScan can be found from Google code documentation.
Simply, the software can be invoked with the following command for the test protein sequence in the InterProScan home directory
./interproscan.sh -i test_proteins.fasta
We can also take a look at the command line usage of InterProScan with the following command
./interproscan.sh
In lookup mode
The lookup mode can be activated with option -iprlookup
and by default, it will check the services provided by the remove EBI server. The command for enabling lookup model is shown as follows
./interproscan.sh -i ../../../Data/tcdb -iprlookup -f tsv
As we have downloaded the database for lookup service, we can start the service with the local machine via the following command
java -Xmx2000m -jar server-5.12-52.0-jetty-console.war
This command will take a while. The output will indicate the host and the port of the service e.g. in my case
1166522 [main] INFO org.eclipse.jetty.server.AbstractConnector - Started @0.0.0.0:8080
Now we have to modify the interproscan.properties
file to acknowledge InterProScan use local lookup service. In particular, we are modifying the following line in the file
precalculated.match.lookup.service.url=http://0.0.0.0:8080
To verify that InterProScan is using lookup services, we use the command line option -goterms
to force the software extract GO terms from the lookup database as in the following command
/interproscan.sh -i ../../../Data/tcdb -iprlookup -f tsv --goterms
If there are GO annotation in the result file, the lookup service is running fine.