Pinboard (rahuldave)

Pinboard (rahuldave) https://pinboard.in/u:rahuldave/public/ recent bookmarks from rahuldave Full Changelog — Astropy v0.2.1 2013-04-04T10:40:37+00:00 http://docs.astropy.org/en/v0.2.1/changelog.html#id1 rahuldave python astropy https://twitter.com/ https://pinboard.in/u:rahuldave/b:34e280c10d22/ astropy 0.2.1 : Python Package Index 2013-04-04T10:40:37+00:00 https://pypi.python.org/pypi/astropy/0.2.1 rahuldave python astropy https://twitter.com/ https://pinboard.in/u:rahuldave/b:4ee9aafa5477/ Improving your code with modern idioms — Porting to Python 3 - The Book Site 2012-05-22T17:04:50+00:00 http://python3porting.com/improving.html rahuldavepython programming https://pinboard.in/ https://pinboard.in/u:rahuldave/b:76a958c1a8a8/ Python as a Lisp dialect 2012-05-03T11:55:11+00:00 http://www.johndcook.com/blog/2012/05/03/python-as-a-lisp-dialect/ rahuldave Python https://pinboard.in/u:rahuldave/b:89bdfc335c7c/ Julia, Python and Cython - julia-dev | Google Groups 2012-04-22T14:03:37+00:00 http://groups.google.com/group/julia-dev/t/61fb4e3847dcc2b9 rahuldave python julialang https://twitter.com/ https://pinboard.in/u:rahuldave/b:8d82c7211972/ images/example3_d3.jpg at master from cschin/IPython-Notebook---d3.js-mashup - GitHub 2012-02-16T02:28:20+00:00 https://github.com/cschin/IPython-Notebook---d3.js-mashup/blob/master/images/example3_d3.jpg rahuldave d3 python https://twitter.com/ https://pinboard.in/u:rahuldave/b:4f493dbee388/ Running Python and R inside Emacs 2012-02-09T13:00:58+00:00 http://www.johndcook.com/blog/2012/02/09/python-org-mode/ rahuldave Python Emacs Literate_programming Reproducibility Rstats https://pinboard.in/u:rahuldave/b:387221004fa1/ Python Introduction - Google's Python Class - Google Code 2012-02-02T18:07:10+00:00 http://code.google.com/edu/languages/google-python-class/introduction.html rahuldavegoogle python programming tutorial https://pinboard.in/ https://pinboard.in/u:rahuldave/b:073050eb4a0a/ How to compute jinc(x) 2012-02-02T16:01:04+00:00 http://feedproxy.google.com/~r/TheEndeavour/~3/-ocmh6wYnrg/ rahuldave double jinc(double x) { return j1(x) / x; } The problem is that if you pass in 0, the code will divide by 0 and return a NaN. The function jinc(x) is defined to be 1/2 at x = 0 because that’s the limit of J1(x)(x) / x as x goes to 0. So we try again: #include double jinc(double x) { return (x == 0.0) ? 0.5 : j1(x) / x; } Does that work? Technically, it could still fail — we’ll come back to that at the end — but we’ll assume for now that it’s OK. We could write the analogous Python code, and it would be adequate as long as we’re only calling the function with scalars and not NumPy arrays. from scipy.special import j1 def jinc(x): if x == 0.0: return 0.5 return j1(x) / x Now suppose you want to plot this function. You create an array of points, say x = np.linspace(-1, 1, 25) and plot jinc(x). You’ll get a warning: “ValueError: The truth value of an array with one element is ambiguous. Use a.any() or a.all().” Incidentally, if we called linspace with an even integer in the last argument, our array of points would avoid zero and the naive implementation of jinc would work. When Python tries to apply jinc to an array, it doesn’t know how to interpret the test x == 0. The warning suggests “Do you mean if any component of x is 0? Or if all components of x are 0?” Neither option is what we want. We want to apply jinc as written to each element of x. We could do this by calling the vectorize function. jinc = np.vectorize(jinc) This replaces our original jinc function with one that handles NumPy arrays correctly. There is an extremely unlikely scenario in which the code above could fail. The value of J1(x) is approximately x/2 for small values of x. If the floating point value x is so small that 0.5*x returns 0, our function will return 0, even though it should return 0.5. The C code above works for values of x as small as DBL_MIN and even values much smaller. (DBL_MIN is not the smallest value of a double, only the smallest normalized double.) But if you set x = DBL_MIN / pow(2.0, 52); then jinc(x) will return 0. If you want to be absolutely safe, you could change the implementation to #include double jinc(double x) { return (fabs(x) < 1e-8) ? 0.5 : j1(x) / x; } Why test for whether the absolute value is less than 10-8 rather than a much smaller number? For small x, the error in approximating jinc(x) with 1/2 is on the order of x2/16. So for x as large as 10-8, the approximation error is below the resolution of a double. As a bonus, the function jinc(x) will be more efficient for |x| < 10-8 since it avoids a call to j1. Related posts: Jinc function Sine approximation for small angles Functions in math.h that seem unnecessary ]]> Python SciPy https://pinboard.in/u:rahuldave/b:5871a164011b/ [untitled] 2012-01-19T16:51:37+00:00 http://www.dabeaz.com/generators/Generators.pdf? rahuldave python https://pinboard.in/ https://pinboard.in/u:rahuldave/b:96ec7ca4409b/ Benford’s law and SciPy 2011-10-19T11:54:00+00:00 http://feedproxy.google.com/~r/TheEndeavour/~3/VqPL0m8y7Ks/ rahuldave Python SciPy https://pinboard.in/u:rahuldave/b:70769a6a8719/ The Technology Behind Convore 2011-02-16T12:29:35+00:00 http://www.eflorenzano.com/blog/post/technology-behind-convore/ rahuldave@ericflo" }, { "type": "username", "user_id": 56, "username": "simonw", "markup": " @simonw" }, { "type": "text", "markup": " Here's how we connect/disconnect from Redis in production: " }, { "type": "url", "url": "http://dpaste.com/406797/", "markup": "http://dpaste.com/406797/" } ] After this is constructed, we log all our available information about this message, and then save to the database—both the raw message as it was received, and the JSON-encoded parsed node list. Now a task is sent to Celery (by way of Redis) notifying it that this new message has been received. This Celery task now increments the unread count for everyone who has access to the topic that the message was posted in, and then it publishes to a Redis pub/sub for the group that the message was posted to. Finally, the task scans through the message, looking for any users that were mentioned in the message, and writes entries to the database for every mention. On the other end of that pub/sub are the many open http requests that our users have initiated, which are waiting for any new messages or information. Those all simultaneously return the new message information, at which point they reconnect again, waiting for the next message to arrive. The real-time endpoint Our live updates endpoint is actually a very simple and lightweight pure-WSGI Python application, hosted using Eventlet. It spawns off a coroutine for each request, and in that coroutine, it looks up all the groups that a user is a member of, and then opens a connection to Redis subscribing to all of those channels. Each of these Eventlet-hosted Python applications has the ability to host hundreds-to-thousands of open connections, and we run several instances on each of our front-end machines. It has a few more responsibilities, like marking a topic as read before it returns a response, but the most important thing is to be a bridge between the user and Redis pub/sub. Future improvements There are so many places where our architecture can be improved. This is our first version, and now that real users are using the system, already some of our initial assumptions are being challenged. For instance, we thought that pub/sub to a channel per group would be enough, but what that means is that everyone in a group sees the exact same events as everyone else in that group. This means we don't have the ability to customize each user's experience based on their preferences--no way to put a user on ignore, filter certain messages, etc. It also means that we aren't able to sync up a user's experience across tabs or browsers, since we don't really want to broadcast to everyone in the group that one user has visited a topic, thereby removing any unread messages in that topic. So going forward we're going to have to break up that per-group pub/sub into per-user pub/sub. Another area that could be improved is our unread counts. Right now they're stored as rows in our PostgreSQL database, which makes it extremely easy to batch update them and do aggregate queries on them, but the number of these rows is increasing rapidly, and without some kind of sharding scheme, it will at some point become more difficult to work with such a large amount of rows. My feeling is that this will eventually need to be moved into a non-relational data store, and we'll need to write a service layer in front of it to deal with pre-aggregating and distributing updates, but nothing is set in stone just yet. Finally, Python may not be the best language for this real-time endpoint. Eventlet is a fantastic Python library and it allowed us to build something extremely fast that has scaled to several thousand concurrent connections without breaking a sweat on launch day, but it has its limits. There is a large body of work out there on handling a large number of open connections, using Java's NIO framework, Erlang's mochiweb, or node.js. That's all folks We're pretty proud of what we've built in a very short time, and we're glad it has held up as well as it has on our launch day and afterwards. We're excited about the problems we're now being faced with, both scaling the technology, and scaling the product. I hope this article has quenched any curiosity out there about how Convore works. If there are any questions, feel free to join Convore and ask away! (Or discuss it on Hacker News) ]]> Convore Django Eventlet Haystack PostgreSQL Python Realtime Redis Solr https://pinboard.in/u:rahuldave/b:a519ac9b35a0/ feature: Tutorial: consuming Twitter's real-time stream API in Python 2010-04-21T17:45:00+00:00 http://feeds.arstechnica.com/~r/arstechnica/index/~3/tGM5tqWsxfY/tutorial-use-twitters-new-real-time-stream-api-in-python.ars rahuldave Features Guides Open-source Web programming python tutorial twitter https://pinboard.in/u:rahuldave/b:58ebd66b4b7c/ Introduction to Surlex 2010-04-11T19:23:35+00:00 http://simonwillison.net/2010/Apr/11/surlex/ rahuldave codysoyland django python regex surlex urls https://pinboard.in/u:rahuldave/b:eca0516a82c5/ The Onion Uses Django, And Why It Matters To Us 2010-03-25T18:43:24+00:00 http://simonwillison.net/2010/Mar/25/onion/ rahuldave django drupal php python reddit theonion https://pinboard.in/u:rahuldave/b:a631a49f5750/ Top Ten One-Liners from CommandLineFu Explained 2010-03-18T03:00:21+00:00 http://feedproxy.google.com/~r/catonmat/~3/GJRqxzmBW9c/ rahuldave> ~/.ssh/authorized_keys This one-liner saves a great deal of typing. Actually I just found out that there was a shorter way to do it: your-machine$ ssh remote-machine 'cat >> .ssh/authorized_keys' < .ssh/identity.pub #10. Capture video of a linux desktop $ ffmpeg -f x11grab -s wxga -r 25 -i :0.0 -sameq /tmp/out.mpg A pure coincidence, I have done so much video processing with ffmpeg that I know what most of this command does without looking much in the manual. The ffmpeg generally can be descibed as a command that takes a bunch of options and the last option is the output file. In this case the options are -f x11grab -s wxga -r 25 -i :0.0 -sameq and the output file is /tmp/out.mpg. Here is what the options mean: -f x11grab makes ffmpeg to set the input video format as x11grab. The X11 framebuffer has a specific format it presents data in and it makes ffmpeg to decode it correctly. -s wxga makes ffmpeg to set the size of the video to wxga which is shortcut for 1366×768. This is a strange resolution to use, I’d just write -s 800x600. -r 25 sets the framerate of the video to 25fps. -i :0.0 sets the video input file to X11 display 0.0 at localhost. -sameq preserves the quality of input stream. It’s best to preserve the quality and post-process it later. You can also specify ffmpeg to grab display from another x-server by changing the -i :0.0 to -i host:0.0. If you’re interested in ffmpeg, here are my other articles on ffmpeg that I wrote while ago: How to Extract Audio Tracks from YouTube Videos Converting YouTube Flash Videos to a Better Format with ffmpeg PS. This article was so fun to write, that I decided to write several more parts. Tune in the next time for “The Next Top Ten One-Liners from CommandLineFu Explained” :) Have fun. See ya! PSS. Follow me on twitter for updates. ]]> Programming authorized_keys bash cd combinatorics commandlinefu cp desktop display event_designators ffmpeg history identity.pub id_rsa.pub linux mtr oldpwd one_liners passwordless_authentication ping public_key_authentication python pythonpath root sets shell simplehttpserver ssh ssh_copy_id ssh_keygen sshv1 sshv2 sudo tee traceroute vim x11 https://pinboard.in/u:rahuldave/b:eb42c63da138/ PyMOTW: Parsing XML Documents with ElementTree 2010-03-14T14:58:00+00:00 http://blog.doughellmann.com/2010/03/pymotw-parsing-xml-documents-with.html rahuldaveMy PodcastsSun, 07 Mar 2010 15:53:26 GMTSun, 07 Mar 2010 15:53:26 GMT To parse the file, pass an open file handle to parse(). It willread the data, parse the XML, and return an ElementTree object. from xml.etree import ElementTreewith open('podcasts.opml', 'rt') as f: tree = ElementTree.parse(f)print tree$ python ElementTree_parse_opml.pyTraversing the Parsed TreeNow that we have a parsed XML tree, we can iterate over it, visitingall of the children in order and examining their attributes andcontents. from xml.etree import ElementTreewith open('podcasts.opml', 'rt') as f: tree = ElementTree.parse(f)for node in tree.getiterator(): print node.tag, node.attribHere we print the entire tree, one tag at a time. $ python ElementTree_dump_opml.pyopml {'version': '1.0'}head {}title {}dateCreated {}dateModified {}body {}outline {'text': 'Science and Tech'}outline {'xmlUrl': 'http://www.publicradio.org/columns/futuretense/podcast.xml', 'text': 'APM: Future Tense', 'type': 'rss', 'htmlUrl': 'http://www.publicradio.org/columns/futuretense/'}outline {'xmlUrl': 'http://www.npr.org/rss/podcast.php?id=510030', 'text': 'Engines Of Our Ingenuity Podcast', 'type': 'rss', 'htmlUrl': 'http://www.uh.edu/engines/engines.htm'}outline {'xmlUrl': 'http://www.nyas.org/Podcasts/Atom.axd', 'text': 'Science & the City', 'type': 'rss', 'htmlUrl': 'http://www.nyas.org/WhatWeDo/SciencetheCity.aspx'}outline {'text': 'Books and Fiction'}outline {'xmlUrl': 'http://feeds.feedburner.com/podiobooks', 'text': 'Podiobooker', 'type': 'rss', 'htmlUrl': 'http://www.podiobooks.com/blog'}outline {'xmlUrl': 'http://web.me.com/normsherman/Site/Podcast/rss.xml', 'text': 'The Drabblecast', 'type': 'rss', 'htmlUrl': 'http://web.me.com/normsherman/Site/Podcast/Podcast.html'}outline {'xmlUrl': 'http://www.tor.com/rss/category/TorDotStories', 'text': 'tor.com / category / tordotstories', 'type': 'rss', 'htmlUrl': 'http://www.tor.com/'}outline {'text': 'Computers and Programming'}outline {'xmlUrl': 'http://leo.am/podcasts/mbw', 'text': 'MacBreak Weekly', 'type': 'rss', 'htmlUrl': 'http://twit.tv/mbw'}outline {'xmlUrl': 'http://leo.am/podcasts/floss', 'text': 'FLOSS Weekly', 'type': 'rss', 'htmlUrl': 'http://twit.tv'}outline {'xmlUrl': 'http://www.coreint.org/podcast.xml', 'text': 'Core Intuition', 'type': 'rss', 'htmlUrl': 'http://www.coreint.org/'}outline {'text': 'Python'}outline {'xmlUrl': 'http://advocacy.python.org/podcasts/pycon.rss', 'text': 'PyCon Podcast', 'type': 'rss', 'htmlUrl': 'http://advocacy.python.org/podcasts/'}outline {'xmlUrl': 'http://advocacy.python.org/podcasts/littlebit.rss', 'text': 'A Little Bit of Python', 'type': 'rss', 'htmlUrl': 'http://advocacy.python.org/podcasts/'}outline {'xmlUrl': 'http://djangodose.com/everything/feed/', 'text': 'Django Dose Everything Feed', 'type': 'rss'}outline {'text': 'Miscelaneous'}outline {'xmlUrl': 'http://www.castsampler.com/cast/feed/rss/dhellmann/', 'text': "dhellmann's CastSampler Feed", 'type': 'rss', 'htmlUrl': 'http://www.castsampler.com/users/dhellmann/'}If we wanted to print only the groups of names and feed URLs for thepodcasts, leaving out of all of the data in the header section, wecould iterate over only just the outline nodes and print thetext and xmlUrl attributes. from xml.etree import ElementTreewith open('podcasts.opml', 'rt') as f: tree = ElementTree.parse(f)for node in tree.getiterator('outline'): name = node.attrib.get('text') url = node.attrib.get('xmlUrl') if name and url: print ' %s :: %s' % (name, url) else: print nameBecause we passed 'outline' to tree.getiterator() processing islimited to only nodes with the tag 'outline'. $ python ElementTree_show_feed_urls.pyScience and Tech APM: Future Tense :: http://www.publicradio.org/columns/futuretense/podcast.xml Engines Of Our Ingenuity Podcast :: http://www.npr.org/rss/podcast.php?id=510030 Science & the City :: http://www.nyas.org/Podcasts/Atom.axdBooks and Fiction Podiobooker :: http://feeds.feedburner.com/podiobooks The Drabblecast :: http://web.me.com/normsherman/Site/Podcast/rss.xml tor.com / category / tordotstories :: http://www.tor.com/rss/category/TorDotStoriesComputers and Programming MacBreak Weekly :: http://leo.am/podcasts/mbw FLOSS Weekly :: http://leo.am/podcasts/floss Core Intuition :: http://www.coreint.org/podcast.xmlPython PyCon Podcast :: http://advocacy.python.org/podcasts/pycon.rss A Little Bit of Python :: http://advocacy.python.org/podcasts/littlebit.rss Django Dose Everything Feed :: http://djangodose.com/everything/feed/Miscelaneous dhellmann's CastSampler Feed :: http://www.castsampler.com/cast/feed/rss/dhellmann/Finding Nodes in a DocumentWalking the entire tree yourself like this searching for relevantnodes can be error prone. In the example above, we had to look ateach outline node to determine if it was a group (nodes with only a“text” attribute) or podcast (with both “text” and “xmlUrl”). If wewere writing a podcast downloader and needed to produce a simple listof the podcast feed URLs, without names or groups, we might simplifythe logic using findall() to look for nodes with more descriptivesearch characteristics. A first pass at converting the above example might construct an XPathargument to look for all outline nodes. from xml.etree import ElementTreewith open('podcasts.opml', 'rt') as f: tree = ElementTree.parse(f)for node in tree.findall('.//outline'): url = node.attrib.get('xmlUrl') if url: print urlThe logic in this version is not substantially different than theversion using getiterator(). We still have to check for thepresence of the URL, except that we don’t print the group name whenthe URL is not found. $ python ElementTree_find_feeds_by_tag.pyhttp://www.publicradio.org/columns/futuretense/podcast.xmlhttp://www.npr.org/rss/podcast.php?id=510030http://www.nyas.org/Podcasts/Atom.axdhttp://feeds.feedburner.com/podiobookshttp://web.me.com/normsherman/Site/Podcast/rss.xmlhttp://www.tor.com/rss/category/TorDotStorieshttp://leo.am/podcasts/mbwhttp://leo.am/podcasts/flosshttp://www.coreint.org/podcast.xmlhttp://advocacy.python.org/podcasts/pycon.rsshttp://advocacy.python.org/podcasts/littlebit.rsshttp://djangodose.com/everything/feed/http://www.castsampler.com/cast/feed/rss/dhellmann/Another version can take advantage of the fact that we know theoutline nodes are only nested two levels deep. If we change thesearch path to .//outline/outline we will process only the secondlevel of outline nodes. from xml.etree import ElementTreewith open('podcasts.opml', 'rt') as f: tree = ElementTree.parse(f)for node in tree.findall('.//outline/outline'): url = node.attrib.get('xmlUrl') print urlWe expect all of those outline nodes nested 2 levels deep in the inputwill have the xmlURL attribute refering to the podcast feed, so if wewere brave we could skip checking for for the attribute before usingit. $ python ElementTree_find_feeds_by_structure.pyhttp://www.publicradio.org/columns/futuretense/podcast.xmlhttp://www.npr.org/rss/podcast.php?id=510030http://www.nyas.org/Podcasts/Atom.axdhttp://feeds.feedburner.com/podiobookshttp://web.me.com/normsherman/Site/Podcast/rss.xmlhttp://www.tor.com/rss/category/TorDotStorieshttp://leo.am/podcasts/mbwhttp://leo.am/podcasts/flosshttp://www.coreint.org/podcast.xmlhttp://advocacy.python.org/podcasts/pycon.rsshttp://advocacy.python.org/podcasts/littlebit.rsshttp://djangodose.com/everything/feed/http://www.castsampler.com/cast/feed/rss/dhellmann/This version is limited to our existing structure, though, so if theoutline nodes are ever rearranged into a deeper tree it will stopworking. Parsed Node AttributesThe items returned by findall() and getiterator() are Elementobjects, each representing a node in the XML parse tree. Each Elementhas attributes for accessing data pulled out of the XML. This can beillustrated with a somewhat more contrived example input file,data.xml: 1234567 This child contains text. This child has regular text.And "tail" text. That & ThisThe “attributes” of a node are available in the attrib property,which acts like a dictionary. from xml.etree import ElementTreewith open('data.xml', 'rt') as f: tree = ElementTree.parse(f)node = tree.find('./with_attributes')print node.tagfor name, value in sorted(node.attrib.items()): print ' %-4s = "%s"' % (name, value) The node on line 5 of the input file has 2 attributes, name and foo. $ python ElementTree_node_attributes.pywith_attributes foo = "bar" name = "value"The text content of the nodes is available, along with the “tail” textthat comes after the end of a close tag. from xml.etree import ElementTreewith open('data.xml', 'rt') as f: tree = ElementTree.parse(f)for path in [ './child', './child_with_tail' ]: node = tree.find(path) print node.tag print ' child node text:', node.text print ' and tail text :', node.tailThe child node on line 3 contains embedded text, and the node online 4 has text with a tail (including any whitespace). $ python ElementTree_node_text.pychild child node text: This child contains text. and tail text :child_with_tail child node text: This child has regular text. and tail text : And "tail" text.Conveniently, XML entity references embedded in the document areconverted to the appropriate characters before values are returned. from xml.etree import ElementTreewith open('data.xml', 'rt') as f: tree = ElementTree.parse(f)node = tree.find('entity_expansion')print node.tagprint ' in attribute:', node.attrib['attribute']print ' in text :', node.textThe conversion saves you from having to worry about an implementationdetail of representing certain characters in an XML document. $ python ElementTree_entity_references.pyentity_expansion in attribute: This & That in text : That & ThisWatching Events While ParsingThe other API useful for processing XML documents is event-based. Theparser generates start events for opening tags and end eventsfor closing tags. Iterating over the event stream lets you extractdata from the document while parsing it, which is convenient if youdon’t need to manipulate the entire document afterwards and if youwant to avoid holding the entire parsed document in memory. iterparse() returns an iterable that produces tuples containingthe name of the event and the node triggering the event. Events canbe one of: startA new tag has been encountered. The closing angle bracket of thetag was processed, but not the contents.endThe closing angle bracket of a closing tag has been processed. Allof the children were already processed.start-nsStart a namespace declaration.end-nsEnd a namespace declaration.from xml.etree.ElementTree import iterparsedepth = 0prefix_width = 8prefix_dots = '.' * prefix_widthline_template = '{prefix:<0.{prefix_len}}{event:<8}{suffix:<{suffix_len}} {node.tag:<12} {node_id}'for (event, node) in iterparse('podcasts.opml', ['start', 'end', 'start-ns', 'end-ns']): if event == 'end': depth -= 1 prefix_len = depth * 2 print line_template.format(prefix=prefix_dots, prefix_len=prefix_len, suffix='', suffix_len=(prefix_width - prefix_len), node=node, node_id=id(node), event=event, ) if event == 'start': depth += 1By default, only end events are generated. To see other events,pass the list of event names you want to receive to iterparse(),as in this example: $ python ElementTree_show_all_events.pystart opml 876256..start head 876336....start title 888920....end title 888920....start dateCreated 889280....end dateCreated 889280....start dateModified 889320....end dateModified 889320..end head 876336..start body 889400....start outline 889560......start outline 889600......end outline 889600......start outline 889480......end outline 889480......start outline 889680......end outline 889680....end outline 889560....start outline 889720......start outline 889760......end outline 889760......start outline 889840......end outline 889840......start outline 889920......end outline 889920....end outline 889720....start outline 889880......start outline 890040......end outline 890040......start outline 890120......end outline 890120......start outline 890200......end outline 890200....end outline 889880....start outline 890240......start outline 890360......end outline 890360......start outline 890440......end outline 890440......start outline 890520......end outline 890520....end outline 890240....start outline 890640......start outline 890720......end outline 890720....end outline 890640..end body 889400end opml 876256The event-style of processing may be more natural for some operations,such as converting XML input to some other format. For example,suppose we want to convert the list of podcasts we have been workingwith from an XML file to a data file we can load into a spreadsheet ordatabase application. We don’t need to hold the entire data set inmemory at a time, since we’re simply changing the format. import csvfrom xml.etree.ElementTree import iterparseimport syswriter = csv.writer(sys.stdout, quoting=csv.QUOTE_NONNUMERIC)group_name = ''for (event, node) in iterparse('podcasts.opml', events=['start']): if node.tag != 'outline': # Ignore anything not part of the outline continue if not node.attrib.get('xmlUrl'): # Remember the current group group_name = node.attrib['text'] else: # Output a podcast entry writer.writerow( (group_name, node.attrib['text'], node.attrib['xmlUrl'], node.attrib.get('htmlUrl', ''), ) )This example program converts our podcast list to a CSV file, ready tobe imported into another application. $ python ElementTree_write_podcast_csv.py"Science and Tech","APM: Future Tense","http://www.publicradio.org/columns/futuretense/podcast.xml","http://www.publicradio.org/columns/futuretense/""Science and Tech","Engines Of Our Ingenuity Podcast","http://www.npr.org/rss/podcast.php?id=510030","http://www.uh.edu/engines/engines.htm""Science and Tech","Science & the City","http://www.nyas.org/Podcasts/Atom.axd","http://www.nyas.org/WhatWeDo/SciencetheCity.aspx""Books and Fiction","Podiobooker","http://feeds.feedburner.com/podiobooks","http://www.podiobooks.com/blog""Books and Fiction","The Drabblecast","http://web.me.com/normsherman/Site/Podcast/rss.xml","http://web.me.com/normsherman/Site/Podcast/Podcast.html""Books and Fiction","tor.com / category / tordotstories","http://www.tor.com/rss/category/TorDotStories","http://www.tor.com/""Computers and Programming","MacBreak Weekly","http://leo.am/podcasts/mbw","http://twit.tv/mbw""Computers and Programming","FLOSS Weekly","http://leo.am/podcasts/floss","http://twit.tv""Computers and Programming","Core Intuition","http://www.coreint.org/podcast.xml","http://www.coreint.org/""Python","PyCon Podcast","http://advocacy.python.org/podcasts/pycon.rss","http://advocacy.python.org/podcasts/""Python","A Little Bit of Python","http://advocacy.python.org/podcasts/littlebit.rss","http://advocacy.python.org/podcasts/""Python","Django Dose Everything Feed","http://djangodose.com/everything/feed/","""Miscelaneous","dhellmann's CastSampler Feed","http://www.castsampler.com/cast/feed/rss/dhellmann/","http://www.castsampler.com/users/dhellmann/"Creating Your Own Tree BuilderA potentially more efficient means of handling parse events is toreplace the standard tree builder behavior with your own. TheElementTree parser uses an XMLTreeBuilder to process the XML and callmethods on a target class to save the results. The usual output is anElementTree instance created by the default TreeBuilder class. Byreplacing TreeBuilder with your own class, you can receive the eventsbefore the Element nodes are instantiated, saving that portion of theoverhead. The XML-to-CSV app from the previous section can be translated to atree builder. import csvfrom xml.etree.ElementTree import XMLTreeBuilderimport sysclass PodcastListToCSV(object): def __init__(self, outputFile): self.writer = csv.writer(outputFile, quoting=csv.QUOTE_NONNUMERIC) self.group_name = '' return def start(self, tag, attrib): if tag != 'outline': # Ignore anything not part of the outline return if not attrib.get('xmlUrl'): # Remember the current group self.group_name = attrib['text'] else: # Output a podcast entry self.writer.writerow( (self.group_name, attrib['text'], attrib['xmlUrl'], attrib.get('htmlUrl', ''), ) ) def end(self, tag): # Ignore closing tags pass def data(self, data): # Ignore data inside nodes pass def close(self): # Nothing special to do here returntarget = PodcastListToCSV(sys.stdout)parser = XMLTreeBuilder(target=target)with open('podcasts.opml', 'rt') as f: for line in f: parser.feed(line)parser.close()PodcastListToCSV implements the TreeBuilder protocol. Each time anew XML tag is encountered, start() is called with the tag nameand attributes. When a closing tag is seen end() is called withthe name. In between, data() is called when a node has content(the tree builder is expected to keep up with the “current” node).When all of the input is processed, close() is called. It canreturn a value, which will be returned to the user of theXMLTreeBuilder. $ python ElementTree_podcast_csv_treebuilder.py"Science and Tech","APM: Future Tense","http://www.publicradio.org/columns/futuretense/podcast.xml","http://www.publicradio.org/columns/futuretense/""Science and Tech","Engines Of Our Ingenuity Podcast","http://www.npr.org/rss/podcast.php?id=510030","http://www.uh.edu/engines/engines.htm""Science and Tech","Science & the City","http://www.nyas.org/Podcasts/Atom.axd","http://www.nyas.org/WhatWeDo/SciencetheCity.aspx""Books and Fiction","Podiobooker","http://feeds.feedburner.com/podiobooks","http://www.podiobooks.com/blog""Books and Fiction","The Drabblecast","http://web.me.com/normsherman/Site/Podcast/rss.xml","http://web.me.com/normsherman/Site/Podcast/Podcast.html""Books and Fiction","tor.com / category / tordotstories","http://www.tor.com/rss/category/TorDotStories","http://www.tor.com/""Computers and Programming","MacBreak Weekly","http://leo.am/podcasts/mbw","http://twit.tv/mbw""Computers and Programming","FLOSS Weekly","http://leo.am/podcasts/floss","http://twit.tv""Computers and Programming","Core Intuition","http://www.coreint.org/podcast.xml","http://www.coreint.org/""Python","PyCon Podcast","http://advocacy.python.org/podcasts/pycon.rss","http://advocacy.python.org/podcasts/""Python","A Little Bit of Python","http://advocacy.python.org/podcasts/littlebit.rss","http://advocacy.python.org/podcasts/""Python","Django Dose Everything Feed","http://djangodose.com/everything/feed/","""Miscelaneous","dhellmann's CastSampler Feed","http://www.castsampler.com/cast/feed/rss/dhellmann/","http://www.castsampler.com/users/dhellmann/"Parsing StringsTo work with smaller bits of XML text, especially string literals asmight be embedded in the source of a program, usexml.etree.ElementTree.XML and pass a single argument, the stringcontaining the XML to be parsed. from xml.etree.ElementTree import XMLparsed = XML(''' This is child "a". This is child "b". This is child "c". ''')print 'parsed =', parsedfor elem in parsed.getiterator(): print elem.tag if elem.text is not None and elem.text.strip(): print ' text: "%s"' % elem.text if elem.tail is not None and elem.tail.strip(): print ' tail: "%s"' % elem.tail for name, value in sorted(elem.attrib.items()): print ' %-4s = "%s"' % (name, value) printNotice that unlike with parse(), the return value is an Elementinstance instead of an ElementTree. $ python ElementTree_XML.pyparsed = rootgroupchild text: "This is child "a"." id = "a"child text: "This is child "b"." id = "b"groupchild text: "This is child "c"." id = "c"For structured XML that uses the “id” attribute to identify uniquenodes of interest, XMLID() is a convenient way to access the parseresults. from xml.etree.ElementTree import XMLIDtree, id_map = XMLID(''' This is child "a". This is child "b". This is child "c". ''')for key, value in sorted(id_map.items()): print '%s = %s' % (key, value) XMLID() returns the parsed tree as an Element object,along with a dictionary mapping the id attribute strings to theindividual nodes in the tree. $ python ElementTree_XMLID.pya = b = c = See also Outline Processor Markup Language, OPMLDave Winer’s OPML specification and documentation.XPath Support in ElementTreePart of Fredrick Lundh’s original documentation for ElementTree.csvRead and write comma-separated-value filesPyMOTW Home The canonical version of this article ]]> python PyMOTW https://pinboard.in/u:rahuldave/b:e457a23f8452/ Cache Machine: Automatic caching for your Django models 2010-03-11T19:35:32+00:00 http://simonwillison.net/2010/Mar/11/cachemachine/ rahuldave cachemachine caching django memcached mozilla orm ormcaching python redis https://pinboard.in/u:rahuldave/b:f6a05c5e2b20/