Module p.u.feedparser

Part of pida.utils

Universal feed parser

Handles RSS 0.9x, RSS 1.0, RSS 2.0, CDF, Atom 0.3, and Atom 1.0 feeds

Visit http://feedparser.org/ for the latest version Visit http://feedparser.org/docs/ for the latest documentation

Required: Python 2.1 or later Recommended: Python 2.3 or later Recommended: CJKCodecs and iconv_codec <http://cjkpython.i18n.org/>

Line # Kind Name Docs
98 Function _xmlescape Undocumented
132 Class ThingsNobodyCaresAboutButMe Undocumented
133 Class CharacterEncodingOverride Undocumented
134 Class CharacterEncodingUnknown Undocumented
135 Class NonXMLContentType Undocumented
136 Class UndeclaredNamespace Undocumented
166 Function dict Undocumented
172 Class FeedParserDict Undocumented
247 Function zopeCompatibilityHack Undocumented
257 Function _ebcdic_to_ascii Undocumented
284 Function _urljoin Undocumented
288 Class _FeedParserMixin Undocumented
1341 Class _StrictFeedParser Undocumented
1413 Class _BaseHTMLProcessor No class docstring; 1/15 methods documented
1528 Class _LooseFeedParser Undocumented
1552 Class _RelativeURIResolver Undocumented
1591 Function _resolveRelativeURIs Undocumented
1597 Class _HTMLSanitizer Undocumented
1650 Function _sanitizeHTML Undocumented
1689 Class _FeedURLHandler Undocumented
1743 Function _open_resource URL, filename, or string --> stream
1834 Function registerDateHandler Register a date handler function (takes string, returns 9-tuple date in GMT)
1868 Function _parse_date_iso8601 Parse a variety of ISO-8601-compatible formats like 20040105
1961 Function _parse_date_onblog Parse a string according to the OnBlog 8-bit date format
1973 Function _parse_date_nate Parse a string according to the Nate 8-bit date format
1994 Function _parse_date_mssql Parse a string according to the MS SQL date format
2044 Function _parse_date_greek Parse a string according to a Greek 8-bit date format.
2081 Function _parse_date_hungarian Parse a string according to a Hungarian 8-bit date format.
2107 Function _parse_date_w3dtf Undocumented
2202 Function _parse_date_rfc822 Parse an RFC822, RFC1123, RFC2822, or asctime-style date
2226 Function _parse_date Parses a variety of date formats into a 9-tuple in GMT
2242 Function _getCharacterEncoding Get the character encoding of the XML document
2378 Function _toUTF8 Changes an XML data stream on the fly to specify a new encoding
2431 Function _stripDoctype Strips DOCTYPE from XML document, returns (rss_version, stripped_data)
2449 Function parse Parse a feed from a URL, file, stream, or string
def _xmlescape(data):
Undocumented
def dict(aList):
Undocumented
def zopeCompatibilityHack():
Undocumented
def _ebcdic_to_ascii(s):
Undocumented
def _urljoin(base, uri):
Undocumented
def _resolveRelativeURIs(htmlSource, baseURI, encoding):
Undocumented
def _sanitizeHTML(htmlSource, encoding):
Undocumented
def _open_resource(url_file_stream_or_string, etag, modified, agent, referrer, handlers):
URL, filename, or string --> stream

This function lets you define parsers that take any input source (URL, pathname to local or network file, or actual data as a string) and deal with it in a uniform manner. Returned object is guaranteed to have all the basic stdio read methods (read, readline, readlines). Just .close() the object when you're done with it.

If the etag argument is supplied, it will be used as the value of an If-None-Match request header.

If the modified argument is supplied, it must be a tuple of 9 integers as returned by gmtime() in the standard Python time module. This MUST be in GMT (Greenwich Mean Time). The formatted date/time will be used as the value of an If-Modified-Since request header.

If the agent argument is supplied, it will be used as the value of a User-Agent request header.

If the referrer argument is supplied, it will be used as the value of a Referer[sic] request header.

If handlers is supplied, it is a list of handlers used to build a urllib2 opener.

def registerDateHandler(func):
Register a date handler function (takes string, returns 9-tuple date in GMT)
def _parse_date_iso8601(dateString):
Parse a variety of ISO-8601-compatible formats like 20040105
def _parse_date_onblog(dateString):
Parse a string according to the OnBlog 8-bit date format
def _parse_date_nate(dateString):
Parse a string according to the Nate 8-bit date format
def _parse_date_mssql(dateString):
Parse a string according to the MS SQL date format
def _parse_date_greek(dateString):
Parse a string according to a Greek 8-bit date format.
def _parse_date_hungarian(dateString):
Parse a string according to a Hungarian 8-bit date format.
def _parse_date_w3dtf(dateString):
Undocumented
def _parse_date_rfc822(dateString):
Parse an RFC822, RFC1123, RFC2822, or asctime-style date
def _parse_date(dateString):
Parses a variety of date formats into a 9-tuple in GMT
def _getCharacterEncoding(http_headers, xml_data):
Get the character encoding of the XML document

http_headers is a dictionary
xml_data is a raw string (not Unicode)

This is so much trickier than it sounds, it's not even funny.
According to RFC 3023 ('XML Media Types'), if the HTTP Content-Type
is application/xml, application/*+xml,
application/xml-external-parsed-entity, or application/xml-dtd,
the encoding given in the charset parameter of the HTTP Content-Type
takes precedence over the encoding given in the XML prefix within the
document, and defaults to 'utf-8' if neither are specified.  But, if
the HTTP Content-Type is text/xml, text/*+xml, or
text/xml-external-parsed-entity, the encoding given in the XML prefix
within the document is ALWAYS IGNORED and only the encoding given in
the charset parameter of the HTTP Content-Type header should be
respected, and it defaults to 'us-ascii' if not specified.

Furthermore, discussion on the atom-syntax mailing list with the
author of RFC 3023 leads me to the conclusion that any document
served with a Content-Type of text/* and no charset parameter
must be treated as us-ascii.  (We now do this.)  And also that it
must always be flagged as non-well-formed.  (We now do this too.)

If Content-Type is unspecified (input was local file or non-HTTP source)
or unrecognized (server just got it totally wrong), then go by the
encoding given in the XML prefix of the document and default to
'iso-8859-1' as per the HTTP specification (RFC 2616).

Then, assuming we didn't find a character encoding in the HTTP headers
(and the HTTP Content-type allowed us to look in the body), we need
to sniff the first few bytes of the XML data and try to determine
whether the encoding is ASCII-compatible.  Section F of the XML
specification shows the way here:
http://www.w3.org/TR/REC-xml/#sec-guessing-no-ext-info

If the sniffed encoding is not ASCII-compatible, we need to make it
ASCII compatible so that we can sniff further into the XML declaration
to find the encoding attribute, which will tell us the true encoding.

Of course, none of this guarantees that we will be able to parse the
feed in the declared character encoding (assuming it was declared
correctly, which many are not).  CJKCodecs and iconv_codec help a lot;
you should definitely install them if you can.
http://cjkpython.i18n.org/
def _toUTF8(data, encoding):
Changes an XML data stream on the fly to specify a new encoding

data is a raw sequence of bytes (not Unicode) that is presumed to be in %encoding already encoding is a string recognized by encodings.aliases

def _stripDoctype(data):
Strips DOCTYPE from XML document, returns (rss_version, stripped_data)

rss_version may be 'rss091n' or None stripped_data is the same XML document, minus the DOCTYPE

def parse(url_file_stream_or_string, etag=None, modified=None, agent=None, referrer=None, handlers=):
Parse a feed from a URL, file, stream, or string
API Documentation for PIDA, generated by pydoctor.