Abu Ashraf Masnun

Python

Extracting links and their page title from your Twitter Archive

Post author By masnun
Post date December 18, 2015

Twitter allows us to download our Tweets from the account settings page. Once we request our archive, Twitter will take some time to prepare it and send us an email once this is ready. We will get a download link in the email. After unpacking the archive, we shall find a csv file that contains our tweets – tweets.csv. The archive also contains a html page (index.html) that displays our tweets on a nice UI. While this is nice to look at, our primary objective is to extract the links from our tweets.

If we look at the CSV file closely, we shall find a field named expanded_urls which generally contains the urls we use in our tweets. We will work with the values in this field. With the url, we also want to fetch their title. For this we will use Python 3 (I am using 3.5) and we need the requests and beautifulsoup4 packages to download and parse the pages. Let’s install them:

pip install requests
pip install beautifulsoup4

1 2	pip install requests pip install beautifulsoup4

We will follow these steps to extract links and their page titles from the tweets:

Open the csv file and read row by row
Each row contains a tweet, we take the expanded_urls field
This field can contain multiple urls, separated by a comma. We need to iterate over them all
We will skip some domains, for example, we don’t want to visit links to twitter status updates
We fetch the html content using the requests library. If the page doesn’t return a HTTP 200, we ignore the response
We extract the title using beautiful soup and display it

Now let’s convert these steps to codes. Here’s the final script I came up with:

import csv
import requests
from bs4 import BeautifulSoup

DOMAINS_TO_SKIP = ['twitter.com']

with open('tweets.csv', 'r') as csvfile:
    reader = csv.DictReader(csvfile)

    # each row is a tweet
    for row in reader:
        url_string = row.get('expanded_urls')
        urls = url_string.split(",")
        for url in urls:

            # Skip the domains
            skip = False
            for domain in DOMAINS_TO_SKIP:
                if domain in url:
                    skip = True
                    break

            # fetch the title
            if url and not skip:
                print("Crawling: {}".format(url))
                resp = requests.get(url)
                if resp.status_code == 200:
                    soup = BeautifulSoup(resp.content, "html.parser")
                    if soup.title:
                        print("Title: {}".format(soup.title.string))

import csv

import requests

from bs4 import BeautifulSoup

DOMAINS_TO_SKIP = ['twitter.com']

with open('tweets.csv', 'r') as csvfile:

reader = csv.DictReader(csvfile)

# each row is a tweet

for row in reader:

url_string = row.get('expanded_urls')

urls = url_string.split(",")

for url in urls:

# Skip the domains

skip = False

for domain in DOMAINS_TO_SKIP:

if domain in url:

skip = True

break

# fetch the title

if url and not skip:

print("Crawling: {}".format(url))

resp = requests.get(url)

if resp.status_code == 200:

soup = BeautifulSoup(resp.content, "html.parser")

if soup.title:

print("Title: {}".format(soup.title.string))

I am actually using this for a personal project I am doing here – https://github.com/masnun/bookmarks – it’s basically a bare bone django admin app where I intend to store the links I visit/share. I come across a lot of interesting projects, articles, videos and then later lose track of them. Hope this app will remedy that. This piece of code is part of a twitter import functionality of the mentioned app.

Uncategorized

Top 500 StackOverflow contributors from Bangladesh

Post author By masnun
Post date December 12, 2015

Update: The result returned from the StackExchange is slightly outdated. So it might not display the latest reputation or other profile changes and thus slightly affecting the ranking.

Update: Because the large list was affecting the site performance, I have moved it to Github Gist.

This post uses the StackExchange Data Explorer to query the StackOverflow users and grab their data. The Python script to query and parse the data is attached below the ranking. So let’s wait no further and meet the top 500 people on SO from Bangladesh:

View List on Gist

Script:

import requests
import csv

CSV_URL = "http://data.stackexchange.com/stackoverflow/csv/521550"

response = requests.get(CSV_URL)
data = csv.DictReader(response.content.splitlines())

for rank, row in enumerate(data):
    print "#{} - {} - {} - {}".format((rank+1),row['DisplayName'], row['Reputation'], row['WebsiteUrl'])
    
    """
    print "<li><a target='_blank' href='http://stackoverflow.com/users/{}'>{}</a> ({}) </li>".format(
            row['User Link'],
            row['DisplayName'],
            row['Reputation']
        )
    """

import requests

import csv

CSV_URL = "http://data.stackexchange.com/stackoverflow/csv/521550"

response = requests.get(CSV_URL)

data = csv.DictReader(response.content.splitlines())

for rank, row in enumerate(data):

print "#{} - {} - {} - {}".format((rank+1),row['DisplayName'], row['Reputation'], row['WebsiteUrl'])

"""

print "<li><a target='_blank' href='http://stackoverflow.com/users/{}'>{}</a> ({}) </li>".format(

row['User Link'],

row['DisplayName'],

row['Reputation']

)

"""

Python

Python 3: Using blocking functions or codes with asyncio

Post author By masnun
Post date December 7, 2015

We know we can do a lot of async stuff with asyncio but have you ever wondered how to execute blocking codes with it? It’s pretty simple actually, asyncio allows us to run blocking code using BaseEventLoop.run_in_executor method. It will run our functions in parallel and provide us with Future objects which we can await or yield from.

Let’s see an example with the popular requests library:

import asyncio
import requests

loop = asyncio.get_event_loop()

def get_html(url):
	return loop.run_in_executor(None, requests.get, url)

@asyncio.coroutine
def main():
	resp1 = yield from get_html("https://masnun.com")
	resp2 = yield from get_html("http://python.org")
	
	print(resp2, resp1)



loop.run_until_complete(main())

import asyncio

import requests

loop = asyncio.get_event_loop()

def get_html(url):

return loop.run_in_executor(None, requests.get, url)

@asyncio.coroutine

def main():

resp1 = yield from get_html("https://masnun.com")

resp2 = yield from get_html("http://python.org")

print(resp2, resp1)

loop.run_until_complete(main())

If you run the code snippet, you can see how the two responses are fetched asynchronously 🙂

Recent Posts

Recent Comments

Archives

Categories