Blue and yellow graph


Imagine the following situation:

  • You need to import hotels all over the world from a SOAP API endpoint.
  • The list of hotels is in the hundreds of thousands of hotels.
  • The endpoint doesn't have pagination.
  • The response is something like this:
<?xml version="1.0" encoding="utf-8"?>
<soap:Envelope xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/">
  <soap:Body>
    <GetHotelsResponse>
      <GetHotelsResult>
        <Hotels>
          <Hotel>
            <Id>1</Id>
            <Name>Hotel A</Name>
            <Latitude>41.390205</Latitud>
            <Longitude>2.6466666667</Longitud>
          </Hotel>
          <!-- Hundreds of thousands ... -->
          <Hotel>
            <Id>351987</Id>
            <Name>Hotel Z</Name>
            <Latitude>40.416775</Latitud>
            <Longitude>2.154007</Longitud>
          </Hotel>
        </Hoteles>
      </GetHotelsResult>
    </GetHotelsResponse>
  </soap:Body>
</soap:Envelope>

(Yes, this still happens in the real world :( )

You would probably use the suds library to parse SOAP APIs and the first attempt to import the hotels would be something like this:

from suds.client import Client

wsdl = 'https:://api.soap.com/v1?wsdl'
client = Client(wsdl)

hotels = client.service.GetHotels()
for hotel in hotels:
    import_hotel(hotel)

This snippet would work most of the time, but it has an issue in our scenario: suds parses the response and builds all the Python objects in memory at once. Given that the list of hotels is so huge, this snippet would consume all your available memory, or at least, more than the 32GB we have on our server.

Unfortunately, suds does not appear to have a way to parse SOAP responses in an iterative way, but it does have a way to get the raw XML response without parsing it into Python objects. Taking advantage of the lxml library we can build an iterative version:

import cStringIO
from lxml import etree
from suds.client import Client

wsdl = 'https:://api.soap.com/v1?wsdl'
client = Client(wsdl)

# Get the raw XML response without parsing SOAP elements to Python objects
client.set_options(retxml=True)
xml = cStringIO.StringIO(client.service.GetHotels())

# Iterate the XML hotel per hotel
context = etree.iterparse(xml, events=('end',), tag='Hotel')
for _event, xml_hotel in context:
    import_hotel(xml_hotel)

Now we are parsing and importing hotels one by one without consuming all the memory, right? Well, not really. If you run this code you'll see that it begins by importing the hotels very quickly but soon after, it slows down to a crawl. Why? Although etree.iterparse does not consume the entire XML at first, it does not free up the references to nodes from each iteration. We need to manually free up two types of references:

  • References to children nodes that have already been processed.
  • Preceding siblings of the current node whose references from the root node are also implicitly preserved.
import cStringIO
from lxml import etree
from suds.client import Client

wsdl = 'https:://api.soap.com/v1?wsdl'
client = Client(wsdl)

# Get the raw XML response without parsing SOAP elements to Python objects
client.set_options(retxml=True)
xml = cStringIO.StringIO(client.service.GetHotels())

# Iterate the XML hotel per hotel
context = etree.iterparse(xml, events=('end',), tag='Hotel')
for _event, xml_hotel in context:
    import_hotel(xml_hotel)

    # Free memory
    xml_hotel.clear()  # free children
    while xml_hotel.getprevious() is not None:  # free preceding siblings
        del xml_hotel.getparent()[0]

Finally, we can import the entire list of hotels without running out of memory. :)

You can find more details about iterative parsing of XMLs here: https://www.ibm.com/developerworks/xml/library/x-hiperfparse