Categories
Python

Scrapy: Scraping each type of Item to it’s own collection in mongodb

I am using Scrapy and I have two different Items. I want to store entries for each specific item to it’s own mongo collection. For example, let’s assume this is what I have in the items.py file:

I want to store Student items to student collection and Course items to course collection. How do we do that?

If you have used Scrapy before, you already know that for storing data, we use Pipelines. Here’s our own MongoPipeline that stores items to their own collection:

So this is what’s happening:

  • We’re using PyMongo as the mongodb driver
  • I have the MongoDB related configurations to settings. I am getting them and constructing a mongodb client. I am also selecting the database based on a setting
  • In the process_item function, we are getting the type of the item and lowering it’s name. This type name would serve as the mongodb collection name for us.
  • We are inserting the item. We’re calling dict() on the item to get a dictionary representation which we can directly save using PyMongo.

That’s it. Now if you run your spiders, items of each type will go to it’s own collection on mongodb.

Categories
Mac Python

Fixing fatal error: ‘openssl/aes.h’ file not found on OS X

OS X recently started using their own library instead of OpenSSL. So new installations for anything that depends on OpenSSL might fail. If you already have OpenSSL installed on your system, for example using Homebrew, you just need to point to the library while compiling your program.

For example, if you’re trying to install the popular cryptography package from PyPI, you can do these:

The above mentioned package is a common dependency of many other packages, for example Scrapy. So if you encounter an issue like this, try installing that single dependency first and then the dependent package. In this case, first use the above command to install the cryptography package and later install Scrapy.

Categories
Python

Python asyncio: Future, Task and the Event Loop

Event Loop

On any platform, when we want to do something asynchronously, it usually involves an event loop. An event loop is a loop that can register tasks to be executed, execute them, delay or even cancel them and handle different events related to these operations. Generally, we schedule multiple async functions to the event loop. The loop runs one function, while that function waits for IO, it pauses it and runs another. When the first function completes IO, it is resumed. Thus two or more functions can co-operatively run together. This the main goal of an event loop.

The event loop can also pass resource intensive functions to a thread pool for processing. The internals of the event loop is quite complex and we don’t need to worry much about it right away. We just need to remember that the event loop is the mechanism through which we can schedule our async functions and get them executed.

Futures / Tasks

If you are into Javascript too, you probably know about Promise. In Python we have similar concepts – Future/Task. A Future is an object that is supposed to have a result in the future. A Task is a subclass of Future that wraps a coroutine. When the coroutine finishes, the result of the Task is realized.

Coroutines

We discussed Coroutines in our last blog post. It’s a way of pausing a function and returning a series of values periodically. A coroutine can pause the execution of the function by using the yield yield from or await (python 3.5+) keywords in an expression. The function is paused until the yield statement actually gets a value.

Fitting Event Loop and Future/Task Together

It’s simple. We need an event loop and we need to register our future/task objects with the event loop. The loop will schedule and run them. We can add callbacks to our future/task objects so that we can be notified when a future has it’s results.

Very often we choose to use coroutines for our work. We wrap a coroutine in Future and get a Task object. When a coroutine yields, it is paused. When it has a value, it is resumed. When it returns, the Task has completed and gets a value. Any associated callback is run. If the coroutine raises an exception, the Task fails and not resolved.

So let’s move ahead and see example codes.

As you can see already:

  • @asyncio.coroutine declares it as a coroutine
  • loop.create_task(slow_operation()) creates a task from the coroutine returned by slow_operation()
  • task.add_done_callback(got_result) adds a callback to our task
  • loop.run_until_complete(task) runs the event loop until the task is realized. As soon as it has value, the loop terminates

The run_until_complete function is a nice way to manage the loop. Of course we could do this:

Here we make the loop run forever and from our callback, we explicitly shut it down when the future has resolved.