Package CPSRSS :: Module feedparser
[show private | hide private]
[frames | no frames]

Module CPSRSS.feedparser

Universal feed parser

Handles RSS 0.9x, RSS 1.0, RSS 2.0, CDF, Atom feeds

Visit http://feedparser.org/ for the latest version Visit http://feedparser.org/docs/ for the latest documentation

Required: Python 2.1 or later Recommended: Python 2.3 or later Recommended: libxml2 <http://xmlsoft.org/python.html>
Classes
FeedParserDict  
_BaseHTMLProcessor  
_FeedParserMixin  
_FeedURLHandler  
_HTMLSanitizer  
_LooseFeedParser  
_RelativeURIResolver  
_StrictFeedParser  

Exceptions
CharacterEncodingOverride  
CharacterEncodingUnknown  

Function Summary
  parse(url_file_stream_or_string, etag, modified, agent, referrer, handlers)
Parse a feed from a URL, file, stream, or string
  _ebcdic_to_ascii(str)
  _getCharacterEncoding(http_headers, xml_data)
Get the character encoding of the XML document
  _open_resource(url_file_stream_or_string, etag, modified, agent, referrer, handlers)
URL, filename, or string --> stream
  _parse_date(date)
Parses a variety of date formats into a tuple of 9 integers
  _resolveRelativeURIs(htmlSource, baseURI, encoding)
  _sanitizeHTML(htmlSource, encoding)
  _stripDoctype(data)
Strips DOCTYPE from XML document, returns (rss_version, stripped_data)
  _toUTF8(data, encoding)
Changes an XML data stream on the fly to specify a new encoding
  _w3dtf_parse(s)

Variable Summary
str __author__ = 'Mark Pilgrim <http://diveintomark.org/>'
list __contributors__ = ['Jason Diamond <http://injektilo.org...
str __copyright__ = 'Copyright 2002-4, Mark Pilgrim'
str __license__ = 'Python'
str __version__ = '3.2'
str ACCEPT_HEADER = 'application/atom+xml,application/rdf+xm...
list PREFERRED_XML_PARSERS = ['drv_libxml2']
tuple python_version = (2, 4)
dict SUPPORTED_VERSIONS = {'': 'unknown', 'cdf': 'CDF', 'rss0...
int TIDY_MARKUP = 0                                                                     
str USER_AGENT = 'UniversalFeedParser/3.2 +http://feedparser...
dict _additional_timezones = {'ET': -500, 'MT': -700, 'AT': -...
int _debug = 0                                                                     
list _iso8601_matches = [<built-in method match of _sre.SRE_P...
list _iso8601_re = ['(?P<year>\\d{4})-?(?P<month>[01]\\d)-?(?...
list _iso8601_tmpl = ['YYYY-?MM-?DD', 'YYYY-MM', 'YYYY-?OOO',...
SRE_Pattern _korean_date_1_re = (\d{4})\s+(\d{2})\s+(\d{2})...
unicode _korean_day = u'\xec\x9d\xbc'
unicode _korean_month = u'\xec\x9b\x94'
unicode _korean_year = u'\xeb\x85\x84'
NoneType _mxtidy = None                                                                  

Function Details

parse(url_file_stream_or_string, etag=None, modified=None, agent=None, referrer=None, handlers=[])

Parse a feed from a URL, file, stream, or string

_getCharacterEncoding(http_headers, xml_data)

Get the character encoding of the XML document

http_headers is a dictionary xml_data is a raw string (not Unicode)

This is so much trickier than it sounds, it's not even funny. According to RFC 3023 ("XML Media Types"), if the HTTP Content-Type is application/xml, application/*+xml, application/xml-external-parsed-entity, or application/xml-dtd, the encoding given in the charset parameter of the HTTP Content-Type takes precedence over the encoding given in the XML prefix within the document, and defaults to "utf-8" if neither are specified. But, if the HTTP Content-Type is text/xml, text/*+xml, or text/xml-external-parsed-entity, the encoding given in the XML prefix within the document is ALWAYS IGNORED and only the encoding given in the charset parameter of the HTTP Content-Type header should be respected, and it defaults to "us-ascii" if not specified.

Furthermore, discussion on the atom-syntax mailing list with the author of RFC 3023 leads me to the conclusion that any document served with a Content-Type of text/* and no charset parameter must be treated as us-ascii. (We now do this.) And also that it must always be flagged as non-well-formed. (We do not do this.)

If Content-Type is unspecified (input was local file or non-HTTP source) or unrecognized (server just got it totally wrong), then go by the encoding given in the XML prefix of the document and default to "utf-8" as per the XML specification. This part is probably wrong, as HTTP defaults to "iso-8859-1" if no Content-Type is specified.

Also, the default Content-Type and well-formedness of XML documents served as wacky types like "application/octet-stream" is still under discussion.

_open_resource(url_file_stream_or_string, etag, modified, agent, referrer, handlers)

URL, filename, or string --> stream

This function lets you define parsers that take any input source (URL, pathname to local or network file, or actual data as a string) and deal with it in a uniform manner. Returned object is guaranteed to have all the basic stdio read methods (read, readline, readlines). Just .close() the object when you're done with it.

If the etag argument is supplied, it will be used as the value of an If-None-Match request header.

If the modified argument is supplied, it must be a tuple of 9 integers as returned by gmtime() in the standard Python time module. This MUST be in GMT (Greenwich Mean Time). The formatted date/time will be used as the value of an If-Modified-Since request header.

If the agent argument is supplied, it will be used as the value of a User-Agent request header.

If the referrer argument is supplied, it will be used as the value of a Referer[sic] request header.

If handlers is supplied, it is a list of handlers used to build a urllib2 opener.

_parse_date(date)

Parses a variety of date formats into a tuple of 9 integers

_stripDoctype(data)

Strips DOCTYPE from XML document, returns (rss_version, stripped_data)

rss_version may be "rss091n" or None stripped_data is the same XML document, minus the DOCTYPE

_toUTF8(data, encoding)

Changes an XML data stream on the fly to specify a new encoding

data is a raw sequence of bytes (not Unicode) that is presumed to be in %encoding already encoding is a string recognized by encodings.aliases

Variable Details

__author__

Type:
str
Value:
'Mark Pilgrim <http://diveintomark.org/>'                              

__contributors__

Type:
list
Value:
['Jason Diamond <http://injektilo.org/>',
 'John Beimler <http://john.beimler.org/>',
 'Fazal Majid <http://www.majid.info/mylos/weblog/>',
 'Aaron Swartz <http://aaronsw.com>']                                  

__copyright__

Type:
str
Value:
'Copyright 2002-4, Mark Pilgrim'                                       

__license__

Type:
str
Value:
'Python'                                                               

__version__

Type:
str
Value:
'3.2'                                                                  

ACCEPT_HEADER

Type:
str
Value:
'application/atom+xml,application/rdf+xml,application/rss+xml,applicat\
ion/x-netcdf,application/xml;q=0.9,text/xml;q=0.2,*/*;q=0.1'           

PREFERRED_XML_PARSERS

Type:
list
Value:
['drv_libxml2']                                                        

python_version

Type:
tuple
Value:
(2, 4)                                                                 

SUPPORTED_VERSIONS

Type:
dict
Value:
{'': 'unknown',
 'atom': 'Atom (unknown version)',
 'atom01': 'Atom 0.1',
 'atom02': 'Atom 0.2',
 'atom03': 'Atom 0.3',
 'cdf': 'CDF',
 'hotrss': 'Hot RSS',
 'rss': 'RSS (unknown version)',
...                                                                    

TIDY_MARKUP

Type:
int
Value:
0                                                                     

USER_AGENT

Type:
str
Value:
'UniversalFeedParser/3.2 +http://feedparser.org/'                      

_additional_timezones

Type:
dict
Value:
{'ET': -500, 'MT': -700, 'AT': -400, 'PT': -800, 'CT': -600}           

_debug

Type:
int
Value:
0                                                                     

_iso8601_matches

Type:
list
Value:
[<built-in method match of _sre.SRE_Pattern object at 0x8e3e640>,
 <built-in method match of _sre.SRE_Pattern object at 0x8e3ec48>,
 <built-in method match of _sre.SRE_Pattern object at 0x8e3f258>,
 <built-in method match of _sre.SRE_Pattern object at 0x8e3f568>,
 <built-in method match of _sre.SRE_Pattern object at 0x8e3fbe0>,
 <built-in method match of _sre.SRE_Pattern object at 0x8e3fef0>,
 <built-in method match of _sre.SRE_Pattern object at 0x8e404b8>,
 <built-in method match of _sre.SRE_Pattern object at 0x8e407c8>,
...                                                                    

_iso8601_re

Type:
list
Value:
['(?P<year>\\d{4})-?(?P<month>[01]\\d)-?(?P<day>[0123]\\d)(T?(?P<hour>\
\\d{2}):(?P<minute>\\d{2})(:(?P<second>\\d{2}))?(?P<tz>[+-](?P<tzhour>\
\\d{2})(:(?P<tzmin>\\d{2}))?|Z)?)?',
 '(?P<year>\\d{4})-(?P<month>[01]\\d)(T?(?P<hour>\\d{2}):(?P<minute>\\\
d{2})(:(?P<second>\\d{2}))?(?P<tz>[+-](?P<tzhour>\\d{2})(:(?P<tzmin>\\\
d{2}))?|Z)?)?',
 '(?P<year>\\d{4})-?(?P<ordinal>[0123]\\d\\d)(T?(?P<hour>\\d{2}):(?P<m\
inute>\\d{2})(:(?P<second>\\d{2}))?(?P<tz>[+-](?P<tzhour>\\d{2})(:(?P<\
...                                                                    

_iso8601_tmpl

Type:
list
Value:
['YYYY-?MM-?DD',
 'YYYY-MM',
 'YYYY-?OOO',
 'YY-?MM-?DD',
 'YY-?OOO',
 'YYYY',
 '-YY-?MM',
 '-OOO',
...                                                                    

_korean_date_1_re

Type:
SRE_Pattern
Value:
(\d{4})\s+(\d{2})\s+(\d{2})\s+(\d{2}):(\d{2}):(\d{2})         

_korean_day

Type:
unicode
Value:
u'\xec\x9d\xbc'                                                        

_korean_month

Type:
unicode
Value:
u'\xec\x9b\x94'                                                        

_korean_year

Type:
unicode
Value:
u'\xeb\x85\x84'                                                        

_mxtidy

Type:
NoneType
Value:
None                                                                  

Generated by Epydoc 2.1 on Mon Jun 27 12:47:50 2005 http://epydoc.sf.net