$Id$ This directory contains a set of simple tools to download and process sourceforge tracker items. Tracker data is represented as a set of files in a tracker directory. For each tracker item, there are at least two files: item-NNN.xml (index information, created by getindex.py) item-NNN-page.xml (xhtml pages, created by getpages.py) where NNN is the tracker item identifier. For items that have attached files, there's also one or more item-NNN-data-MMM.dat (data files, created by getfiles.py) files, where MMM is a file identifier (referred to by the page files). The data files consists of a copy of the HTTP header (which includes content-type and content-disposition headers), followed by an empty line, and the actual data. -------------------------------------------------------------------- Downloading and Updating Tracker Datasets -------------------------------------------------------------------- To download tracker datasets, run 'init' to set things up, and use the getindex/getpages/getfiles scripts to download items. * init The 'init' script is used to select what tracker to download. It asks for a tracker "group id". To get the group id for your project, check the URL for the tracker homepage. If you press return, the group id defaults to 5470, which is the group id for the Python tracker. The 'init' script downloads the tracker homepage, and creates tracker directories for the individual trackers used by the given project. $ python init.py enter sourceforge tracker group id [5470]: 1234 --- create tracker-123456 You only have to run the 'init' script once for each project. * getindex The 'getindex' script parses the tracker index, and creates item files which contains overview information from the index pages. Usage: $ python getindex.py tracker-123456 [offset] If the offset is omitted, the parser starts at offset 0, and keeps going until it gets an index page for which all items have already been downloaded. If an offset is given, the parser keeps going until it cannot find any more items. You can use the output from 'getindex' to generate tracker statistics. To get more information about the items, use the 'getpages' and 'get- files' scripts. * getpages The 'getpages' script looks for item files, and downloads missing page files. $ python getpages.py tracker-123456 To refresh the page files, remove them from the tracker directory, and run the 'getpages' script again. $ rm tracker-123456/*-page.xml $ python getpages.py tracker-123456 * getfiles The 'getfiles' script, finally, looks for download links in the page files, and downloads missing data files. $ python getfiles.py tracker-123456 * status The 'status' script can be used to get a download status summary: $ python status.py tracker-123456 6682 items 6682 pages (100%) 1912 files -------------------------------------------------------------------- Processing Tracker Datasets -------------------------------------------------------------------- To process tracker datasets, use the 'extract' module to extract relevant information from item-NNN-page.xml files. See the export scripts for examples: csv-export.py is a simple dataset to CSV exporter. xml-export.py is a simple dataset to XML exporter. The resulting XML file contains all data from the tracker dataset, including attached files (stored as BASE64-encoded blocks). More export scripts, bug fixes, and other contributions are welcome.