Help with code

This is the place for queries that don't fit in any of the other categories.

Help with code

Postby betty3 » Wed May 22, 2013 9:53 am

Hi,
I have following file which my code reads (test_file2.txt) and my python code.
Code: Select all
#!/usr/bin/python

#*********WORKING DIRECTORY***********************************************************************************************************
workDir1 = "/home/ela/Project/Pfam"

#***************INPUT FILES***********************************************************************************************************
file1 = "%s/draft/output/test_file2.txt" % (workDir1)
file2 = "%s/draft/output/output/not_overlapping_with_oneDomain.txt" % (workDir1)
file3 = "%s/draft/output/output/not_overlapping_eValue1Greater.txt" % (workDir1)
file4 = "%s/draft/output/output/not_overlapping_eValue2Greater.txt" % (workDir1)
#***************************************************************************************************************************************
repDom = dict()
fh1 = open (file1, "r")
fhout1 = open(file2, "w")
fhout2 = open(file3, "w")
fhout3 = open (file4, "w")
t = tuple()
set1=set()
set2= set()
#***************************************************************************************************************************************
def getValue(item):
   return int(item[1])


for line in fh1.readlines():
   
    line= line.split('\t')
    fgId = line[0]#first column in the file are FG id
    pfamAcc = line[2]#third column in the file are pfam accession number
    start = line[4]
    end = line[5]
    eValue = line[8].replace('\n','')
    t = (pfamAcc,start, end, eValue)
    repDom.setdefault(fgId, list())
    repDom[fgId].append(t)#filling the value(set)with particular accession number for each FG Id

for key, value in repDom.items():
   if len(value) > 1:
      domainSorted= sorted(value, key = getValue)
      value = domainSorted   
      n = len(domainSorted)
      for i in xrange(n-1):
           pfamAcc,istart, iend, eValue = domainSorted[i]
            for j in xrange(i+1,n):
                 pfamAcc,jstart, jend, eValue = domainSorted[j]
                 if int(jstart) < int(iend):
                         a = int(iend) - int(jstart)
                         b = len(xrange(int(istart), int(iend)))
                         c = (a *100)/b
                         if c > 10:
                            print "overlap:" ,key, i,domainSorted[i],j, domainSorted[j], c
                 
                            if (float(domainSorted[i][3]) - float(domainSorted[j][3])) > 5:
            #set1.add(domainSorted[i])
                                print domainSorted[i][3]
                
                     #print "good Domain1:",i, domainSorted[i]
                               #print "good domain1", key, domainSorted[i][0]
                #fhout2.write("%s\t%s\n" %(key,domainSorted[i][0]))
                             
                            elif (float(domainSorted[j][3]) - float(domainSorted[i][3])) > 5:
                                #set2.add(domainSorted[j])

                               
                               
                     print "good Domain2:", j, domainSorted[j]
                #print "good domain2", key, domainSorted[j][0]
                #fhout3.write("%s\t%s\n" %(key,domainSorted[j][0]))
                  #else:
                    # print "still overlap:",i, domainSorted[i], j, domainSorted[j], c
                   
                    else:
          
                  print " not overlap:", i, value[i], j, value[j]
                     
                         #print key, set1.difference(set2)
  # else:
   #print key, value
        #fhout1.write("%s\t%s\n" %(key,value))           

In this code I am looking for overlapping domains. They overlap if more than 10% of their length overlaps. When they overlap I compare evalue for each of the overlapping pair. When the difference is greater than 5 I would like to leave domain with the higher evalue and the other discard. This is where I stack with my code for protein with more than 2 domains.

For example for protein FGSG_00032 I should be left with domain 2 which has greater evalue than domain 3 and domain 1which are neighbours of domain 2. Domain 0 was excluded while comparing to domain 1 as has smaller evalue.
How can I address this problem in my code? I would like to have as an output the proteins with non-overlapping domains base on the condition above.
Please help to correct my code.
Many thanks
Attachments
test_file2.txt
The file that my python code reads
(3.64 KiB) Downloaded 18 times
Last edited by stranac on Wed May 22, 2013 10:16 am, edited 1 time in total.
Reason: Included the code in post, instead of an attachment
betty3
 
Posts: 3
Joined: Tue May 07, 2013 9:19 pm

Re: Help with code

Postby MichelFJM » Wed May 22, 2013 1:47 pm

Hello

The problem is probably that you do not treat the datas by pairs in your code although you say you shoud.
You should reduce your file test_file2.txt to the lines with FGSG_00032 and add prints in the code to find the error.
len(xrange(int(istart), int(iend))) should be changed by int(iend)-int(istart)
Good luck.
MichelFJM
 
Posts: 19
Joined: Wed May 22, 2013 1:41 pm

Re: Help with code

Postby metulburr » Wed May 22, 2013 1:55 pm

In this code I am looking for overlapping domains. They overlap if more than 10% of their length overlaps. When they overlap I compare evalue for each of the overlapping pair. When the difference is greater than 5 I would like to leave domain with the higher evalue and the other discard. This is where I stack with my code for protein with more than 2 domains.

For example for protein FGSG_00032 I should be left with domain 2 which has greater evalue than domain 3 and domain 1which are neighbours of domain 2. Domain 0 was excluded while comparing to domain 1 as has smaller evalue.


After reading this i still have no idea what you are trying to do. Actually i feel a little dumber now with a slight headache. Something like "I am getting X as output and I want Y as the output", would suffice.

Also replacing the file with a string of a small portion of the file, would be recommended. Currently your code as is, is too much effort on my part to get working

and so people dont have to download the text file to view it:
Code: Select all
FGSG_00006      PF11327.3   Protein of unknown function (DUF3129)   16   270   208.82   1.90E-59   58.7212464
FGSG_00011      PF00106.20   short chain dehydrogenase   5   173   36.69   1.20E-07   6.920818754
FGSG_00012      PF00067.17   Cytochrome P450   33   517   186.81   8.00E-53   52.09691001
FGSG_00013      PF00083.19   Sugar (and other) transporter   59   520   369.67   7.10E-108   107.1487417
FGSG_00013      PF07690.11   Major Facilitator Superfamily   63   473   66.58   1.20E-16   15.92081875
FGSG_00014      PF13641.1   Glycosyltransferase like family 2   21   255   59.53   1.60E-14   13.79588002
FGSG_00014      PF00535.21   Glycosyl transferase family 2   24   191   46.64   1.20E-10   9.920818754
FGSG_00015      PF00083.19   Sugar (and other) transporter   44   497   254.62   3.10E-73   72.50863831
FGSG_00015      PF07690.11   Major Facilitator Superfamily   49   449   59.5   1.70E-14   13.76955108
FGSG_00017      PF01546.23   Peptidase family M20/M25/M40   152   558   70.83   6.50E-18   17.18708664
FGSG_00017      PF07687.9   Peptidase dimerisation domain   272   434   50.16   1.10E-11   10.95860731
FGSG_00018      PF00487.19   Fatty acid desaturase   44   274   76.03   1.80E-19   18.74472749
FGSG_00019      PF00071.17   Ras family   359   529   42.88   1.70E-09   8.769551079
FGSG_00019      PF08477.8   Miro-like protein   359   480   23.7   0.001003   2.998699067
FGSG_00023      PF00190.17   Cupin   78   212   35.71   2.40E-07   6.619788758
FGSG_00024      PF00144.19   Beta-lactamase   15   373   224.94   2.60E-64   63.58502665
FGSG_00026      PF07690.11   Major Facilitator Superfamily   59   397   69.87   1.30E-17   16.88605665
FGSG_00028      PF05572.8   Pregnancy-associated plasma protein-A   113   271   58.57   3.20E-14   13.49485002
FGSG_00032      PF12697.2   Alpha/beta hydrolase family   27   271   126.83   9.00E-35   34.04575749
FGSG_00032      PF00561.15   alpha/beta hydrolase fold   52   274   73.21   1.30E-18   17.88605665
FGSG_00032      PF12695.2   Alpha/beta hydrolase family   26   259   58.43   3.50E-14   13.45593196
FGSG_00032      PF12146.3   Putative lysophospholipase   11   89   35.71   2.40E-07   6.619788758
FGSG_00034      PF00083.19   Sugar (and other) transporter   54   509   222.71   1.20E-63   62.92081875
FGSG_00036      PF13561.1   Enoyl-(Acyl carrier protein) reductase   552   801   57.51   6.70E-14   13.1739252
FGSG_00036      PF00109.21   "Beta-ketoacyl synthase, N-terminal domain"   1014   1263   56.41   1.40E-13   12.85387196
FGSG_00036      PF02801.17   "Beta-ketoacyl synthase, C-terminal domain"   1342   1479   53.52   1.10E-12   11.95860731
FGSG_00036      PF00106.20   short chain dehydrogenase   546   727   32.54   0.000002   5.698970004
FGSG_00040      PF13414.1   TPR repeat   185   255   42.82   1.80E-09   8.744727495
FGSG_00040      PF00856.23   SET domain   347   531   36.04   1.90E-07   6.721246399
FGSG_00040      PF13174.1   Tetratricopeptide repeat   259   291   32   0.000003   5.522878745
FGSG_00040      PF07719.12   Tetratricopeptide repeat   258   291   28.37   0.000039   4.408935393
FGSG_00043      PF07992.9   Pyridine nucleotide-disulphide oxidoreductase   6   320   66.04   1.80E-16   15.74472749
FGSG_00044      PF12697.2   Alpha/beta hydrolase family   44   313   55.21   3.30E-13   12.48148606
FGSG_00045      PF12697.2   Alpha/beta hydrolase family   47   264   96.35   1.40E-25   24.85387196
FGSG_00045      PF12695.2   Alpha/beta hydrolase family   46   252   45.59   2.60E-10   9.585026652
FGSG_00046      PF00005.22   ABC transporter   634   746   119.55   1.40E-32   31.85387196
FGSG_00046      PF00005.22   ABC transporter   1251   1372   119.55   1.40E-32   31.85387196
FGSG_00046      PF00664.18   ABC transporter transmembrane region   256   529   97.98   4.40E-26   25.35654732
FGSG_00046      PF00664.18   ABC transporter transmembrane region   868   1164   97.98   4.40E-26   25.35654732
FGSG_00048      PF14226.1   non-haem dioxygenase in morphine synthesis N-terminal   8   129   93.47   1.00E-24   24
FGSG_00048      PF03171.15   2OG-Fe(II) oxygenase superfamily   176   291   52.61   2.00E-12   11.69897
FGSG_00049      PF01063.14   Aminotransferase class IV   72   324   74.84   4.00E-19   18.39794001
FGSG_00050      PF07859.8   alpha/beta hydrolase fold   88   291   38.93   2.60E-08   7.585026652
FGSG_00050      PF12695.2   Alpha/beta hydrolase family   87   288   29.72   0.000015   4.823908741
New Users, Read This
OS Ubuntu 14.04, Arch Linux, Gentoo, Windows 7/8
https://github.com/metulburr
steam
User avatar
metulburr
 
Posts: 1381
Joined: Thu Feb 07, 2013 4:47 pm
Location: Elmira, NY


Return to General Coding Help

Who is online

Users browsing this forum: cairo, casevh, D-termined, mr ham and 1 guest