Saturday, June 12, 2010

A simple program for FTP directory listing using pycurl

  1. #import pycurl
  2. import pycurl
  3. import StringIO
  4. # lets create a pycurl object
  5. c = pycurl.Curl()
  6. # lets specify the details of FTP server
  7. c.setopt(pycurl.URL, r'ftp://ftp.ncbi.nih.gov/refseq/release/')
  8. # lets create a buffer in which we will write the output
  9. output = StringIO.StringIO()
  10. # lets assign this buffer to pycurl object
  11. c.setopt(pycurl.WRITEFUNCTION, output.write)
  12. # lets perform the LIST operation
  13. c.perform()
  14. # lets get the output in a string
  15. result = output.getvalue()
  16. # lets print the string on screen
  17. print result
  18. # FTP LIST output is separated by \r\n
  19. # lets split the output in lines
  20. lines = result.split('\r\n')
  21. # lets print the number of lines
  22. print len(lines)
  23. # lets walk through each line
  24. for line in lines:
  25.     # lets print each part separately
  26.     parts = line.split()
  27.     # we can print the parts now
  28.     print parts
  29.     # the individual fields in this list of parts
  30.     if not parts: continue
  31.     permissions = parts[0]
  32.     group = parts[2]
  33.     user = parts[3]
  34.     size = parts[4]
  35.     month = parts[5]
  36.     day = parts[6]
  37.     yearortime = parts[7]
  38.     name = parts[8]

The above program

  • Creates a pycurl object
  • Specifies the URL of an FTP server (anonymous account)
  • Creates a StringIO buffer to store the results of FTP LIST command
  • Associates the pycurl object with the StringIO buffer for writing output received from FTP server
  • Performs the curl operation
  • Extracts the output
  • Breaks the output in lines (considering \r\n as separator)
  • Walks through the lines one by one
  • Splits the line based on whitespace into different parts
  • Extracts different fields from the directory listing (permissions, group, user, size, filename etc.)

Notes about processing the output of FTP LIST command

The response of FTP LIST command is very much non-standard. Different flavors of FTP servers simply display the directory listing differently. So the parsing of this output may be easy for one FTP server but a code for parsing directory listings which works across all kinds of FTP servers is difficult to write. This is probably the reason why this functionality is not provided in ftplib (Python Standard Library). In
the FTP standard, the output of FTP LIST command was intended for human consumption rather than computer interpretation which led to all the variations over the years.

FTPPARSE http://cr.yp.to/ftpparse.html is a library for parsing FTP LIST command responses for a variety of FTP servers. ftpparse currently understands the LIST output from any UNIX server, Microsoft FTP Service, Windows NT FTP Server, VMS, WFTPD, NetPresenz, NetWare, and MSDOS. Its easy to write a Python wrapper for this library using ctypes.

Even this library doesn't work for a number of situations:
- When the size of a file is bigger than 2 GB.
- FTP servers of various video servers (I have seen GUI FTP clients like FileZilla or Windows explorer suck on some of them)

No comments: