May 2, 2015

QNAP/Linux - Python Programming 05 (Learn by Doing: Scan and Parse Files: Part 1)

Many Apps for NAS systems need to process file data within local file systems or from remote file servers. This "Learn by Doing" sets the problem:
  • Scan files (local or remote) 
  • Store metadata folders and files into database
  • Extract metadata and text information from specific file types, such as PDF, Office, images, etc.
User manual of fscan.py
According the User Manual of fscan.py, following codes do the program framework.
The empty program framework of fscan.py
  • def main(argv): main program just runs parseArgs()
  • def parseArgs(): parse arguments
  • def printHelp(): print help message for '-h' arguments
  • if __name__ == "__main__": main(sys.argv[1:]): the entry point is main()

addRepository()

Parse URL/URI

First, we have to implement the function addRepository() that checks the added "host/folder" is correct and stores the information into db. To extend the "host/folder" definition into URI/URL, search "python parse url" from Google and get the first result page: 20.16. urlparse — Parse URLs into components. I changed the link to Python 2.6.x document. See the sample code is very easy, just copy/paste into Python interpreter for testing.
urlparse: return (scheme, netloc, path, params, query, fragment)
According to the documentation, urlparse supports URL schemes: file, ftp, gopher, hdl, http, https, imap, mailto, mms, news, nntp, prospero, rsync, rtsp, rtspu, sftp, shttp, sip, sips, snews, svn, svn+ssh, telnet, wais. In this article, only local/remote file repositories are parsed and added.
  • file://localhost/path: e.g. "/share/" denotes the repository file://localhost/share/..

File and Directory Access

  • os.path.isdir(arg): Check if the local directory exists or not.
  • os.path.abspath(arg): Get the absolute path of local dir. Then map into: file://localhost/path.
addRepository() extracts and parses the path to insertDbResp() as a task

Using psycopg2 with PostgreSQL

PostgreSQL supports trust connection (for security, set to localhost only), so I built a database for use "nas" and create tables, views, and functions in the database "fs". Then test db connection and retrieve db data.
sample code for db connection and db query
get the db record 
  • pg_hba.conf: "host    all     all    127.0.0.1/32   trust" or "host    all     all   localhost   trust"
  • import psycopg2: ssh runs "ipkg install py26-psycopg2" to install the package.
  • global conn: Create a global object for db connection to avoid open/close connections frequently.
  • conn = psycopg2.connect("host=localhost dbname=fs user=nas"): trust connection with user=nas.
A repository is corresponding to a record of Class table, therefore following metadata are stored in db.
  • Name: e.g. share
  • Since: create data time, i.e. now().
  • LastModifyDT: last modified data time of the repository for checking validation.
  • Scheme: e.g. file=11; smb=12; ftp=13, etc. 
  • Host: localhost
  • Path: /path
  • URI: a derived attribute, i.e. file://localhost/path
No matter what's the detail of the Class table and relationships between other tables, just create functions or views for your needs. In this case, I create a function InsertClass(parentCID, Name, LastModifyDT, Scheme, Host, Path) to add a class record under "/Home/Repository" for building the Class hierarchy.

To get the metadata (last modified date) of a directory, Google "python directory time" and obtained "How to get file creation & modification date/times in Python?". I learned a concise coding style of Python:
How to get file creation & modification date/times in Python?
I borrowed the code to get metadata of a directory and pack arguments of a SQL function. In fact, the "last modified data" of directory is retrieved to mtime.
use (a, b, .., ) to get return array
use (a, b, .., ) to pack arguments of function()
  • os.stat() retrieves metadata of a directory or a file into an array assigned to (...).
  • time.ctime() transforms datetime (in second mode) into symbolic datetime format.
  • sql = (...) packs arguments of SQL function, then transfers sql into string with str(sql).
return object to array with a list of variables
array to string for packing arguments
Finally, just complete the SQL function InsertClassFS(...), the program can be tested to successfully add a repository into the Class hierarchy.
InsertClass: add a class as a subclass of _PCID
InsertClassFS(): reuse InsertClass() to add a repository as a subclass of "Root/Repository"
Add codes to execute SQL and commit the transaction.
Execute SQL and Commit Transaction
 I run SQL codes (line 1 and 2) to update the information of Class table as shown below.
Class table before execution
Run the program to add a repository
Class table after execution

... To be continued!


No comments :

Post a Comment