OpenBlock

a hyperlocal django project

PyGotham 2011

http://openblockproject.org

Paul Winkler

OpenPlans

Warning, there is not much code in this talk.

What?

An open source (GPL) platform for hyperlocal news. Version 1.0.1.

Who?

Why?

Our mandate: Make it easier to get running.

Before...

vanilla_orig_timeline_600.png

... After

phil_timeline_600.png

Pitch

You really really want to:

Nifty Features

What's Local News?

Anything with a location and a date. e.g.:

Boston demo walkthrough

http://demo.openblockproject.org

Top down from home page

Search:

Map

Location type list

"Stay up to date"

News type list

Widgets

But we're not at PyBoston!

Let's make one for NYC from scratch.

Well, almost.

http://openblockproject.org/docs/install

dropdb openblock_nycblock; createdb -T template_postgis openblock_nycblock
django-admin.py syncdb --migrate
django-admin.py process_tasks &
django-admin.py runserver

The settings file

from ebpub.settings_default import *
SHORT_NAME = 'new york'
DEFAULT_MAP_CENTER_LON = -73.949776
DEFAULT_MAP_CENTER_LAT = 40.741014
DEFAULT_MAP_ZOOM = 10

METRO_LIST = ({
        'extent': (-74.259567, 40.493959, -73.766384, 40.888601),
        'multiple_cities': True,
        'city_name': 'New York',
        # The SHORT_NAME in the settings file.
        'short_name': SHORT_NAME,
        'metro_name': 'New York',
        'state': 'NY',
        'state_name': 'New York',
        'time_zone': TIME_ZONE,
        'city_location_type': 'boroughs',
    })

Setup: Loading Boroughs

http://localhost:8000/admin/db/locationtype/add/ http://localhost:8000/admin/db/location/upload-shapefile/

We do these first because we can then leverage them when loading blocks data, so blocks know what borough they are in.

Create a Boroughs location type. Then upload http://www.nyc.gov/html/dcp/download/bytes/nybb_10cav.zip

Setup: Loading zip codes

http://localhost:8000/admin/db/location/import-zip-shapefiles/

Paste in these:

10001 10002 10003 10005 10006 10007 10008 10009 10010 10012 10013 10014 10016 10017 10018 10019 10020 10021 10022 10023 10024 10025 10027 10028 10029 10030 10031 10032 10033 10034 10035 10036 10037 10038 10039 10040 10041 10055
bug: zip code import should replace existing, not barf

Setup: Loading Neighborhoods

http://localhost:8000/admin/db/locationtype/add/ http://localhost:8000/admin/db/location/upload-shapefile/

Create a Neighborhood location type Be sure to select it before block import!! Zillow - ugh, works but it's pretty bad, things missing, or in vastly wrong place

Setup: Loading blocks

Just Manhattan for now. This one takes ~ 8 minutes.

http://localhost:8000/admin/streets/block/import-blocks/

BASEURL is http://tigerline.census.gov/geo/tiger/TIGER2009/36_NEW_YORK/

Manhattan http://tigerline.census.gov/geo/tiger/TIGER2009/36_NEW_YORK/36061_New_York_County/tl_2009_36061_featnames.zip http://tigerline.census.gov/geo/tiger/TIGER2009/36_NEW_YORK/36061_New_York_County/tl_2009_36061_faces.zip http://tigerline.census.gov/geo/tiger/TIGER2009/36_NEW_YORK/36061_New_York_County/tl_2009_36061_edges.zip

Where's the news?

Let's drop in some ready-made stuff: Meetups, Flickr photos, Open311 / SeeClickFix issues.

These are data sources that I know serve NYC, and that openblock has generic scraper scripts for.

First load a fixture that configures a news type to store each of these.

Blocks MUST be loaded before we can scrape.

Then we can scrape them

Flickr

django-admin.py loaddata \
ebdata/ebdata/scrapers/general/flickr/photos_schema.json

python ebdata/ebdata/scrapers/general/flickr/flickr_retrieval.py
update_aggregates

SeeClickFix issues

Uses Open311 API:

django-admin.py loaddata \
ebdata/ebdata/scrapers/general/open311/open311_service_requests_schema.json

python ebdata/ebdata/scrapers/general/open311/georeportv2.py \
--days-prior=10 \
--html-url-template=http://seeclickfix.com/issues/{id} \
http://seeclickfix.com/new-york/open311/v2

update_aggregates

Meetups

There are lots of these:

django-admin.py loaddata \
ebdata/ebdata/scrapers/general/meetup/meetup_schema.json

python ebdata/ebdata/scrapers/general/meetup/meetup_retrieval.py
update_aggregates

REST API: Searching News

curl "http://localhost:8000/api/dev1/items.json?\
locationid=neighborhoods/midtown&limit=2"

Result is GeoJSON:

{"type": "FeatureCollection", "features": [
{"geometry": {"type": "Point",
  "coordinates": [
   -73.991821000000002, 40.768695999999998]
 },
 "type": "Feature",
 "properties": {
  "location_name": "Btw 10th and 11th Ave at 52nd and 54th, New York, NY, 10019",
  "venue_name": "DeWitt Clinton Park Dog Run",
  "start_time": "11:30:00-05:00",
  "title": "Hells Kitchen Pug Meetup",
  "group_name": "The New York City Pug Meetup Group",
  "url": "http://www.meetup.com/NYCPugs/events/30986161/",
  "venue_phone": "",
  "item_date": "2011-09-18",
  ...

More API Features

Under the Hood

The good parts are things that are good about Django: simple, straightforward design.

A quick look at two of the scarier corners...

Data Model

NewsItems have "semi-extensible" metadata. E.g. "restaurant inspections" could have a different set of metadata fields than "police reports."

Designed for simple and fast data retrieval

Tedious to configure (we hid it behind an admin UI facade)

Complicated implementation

don't say I didn't warn you

Data Model: NewsItem

class NewsItem(models.Model):
    schema = models.ForeignKey(Schema)
    title = models.CharField(max_length=255)
    description = models.TextField()
    # Treat it like a dict.
    attributes = AttributesDescriptor()
    ...

Data Model: Schema

class Schema(models.Model):
    """Describes a type of NewsItem.  A NewsItem
    has exactly one Schema, which describes its
    Attributes, via associated SchemaFields."""
    ...
    slug = models.SlugField(max_length=32,
           unique=True)
    ...

Data Model: Attribute

class Attribute(models.Model):
    """Extended metadata for NewsItems."""
    news_item = models.OneToOneField(NewsItem,
        primary_key=True, unique=True)
    schema = models.ForeignKey(Schema)
    varchar01 = models.CharField(
        max_length=4096, blank=True, null=True)
    varchar02 = models.CharField(
        max_length=4096, blank=True, null=True)
    ...
    date01 = models.DateField(blank=True, null=True)
    date02 = models.DateField(blank=True, null=True)
    ...

Data Model: SchemaField

class SchemaField(models.Model):
    """Describes the meaning of one Attribute field
    for one Schema type."""
    schema = models.ForeignKey(Schema)
    name = models.SlugField(max_length=32)
    real_name = models.CharField(
        max_length=10,
        help_text= "Column name for Attributes."
                  " 'varchar01', 'varchar02', etc.")
    ...

Data Model: all together

item = NewsItem.objects.get(schema__slug= 'foo',
                            ...)
item.attributes['bar'] = 'ouch'
# Equivalent to...
schemafield = SchemaField.objects.filter(
    schema__slug= 'foo', name= 'bar')
attrs =  Attributes.objects.get(newsitem=item)
setattr(attrs, schemafield.real_name, 'ouch')
attrs.save()

There is scarier stuff: "Lookups"

Oh my. Why not nosql?

We need from PostGIS: multipolygons, intersection & containment queries, a few triggers.

Mongo has only a few geometry types, not multipolygon

Geocouch has all geometry types, but only bbox queries.

Any other options?

Data model: Why not...

"EAV" pattern? (aka vertical tables)

OpenBlock's approach gives faster retrieval, with types. (But stores fewer values.)

Two data stores, postgis plus nosql?

Mandate: fewer install / admin headaches, not more!

Solving this is not in scope of our contract

too bad, it could have been fun

It would have taken a pretty big rewrite, and while the design many be odd, it works fine.

Data Model: Hide it in the admin UI

http://localhost:8000/admin/db/schema/2/

Address extraction

Try this:

curl -i --data-urlencode q@- http://demo.openblockproject.org/api/geotag/ <<EOF
I was on my way to 325 Massachusetts Ave and I met some guy named
Bob Jones who said "isn't that near Back Bay?" and I said I think it's
on Mass near Shawmut. And then I sang an Olivia Newton John song.
EOF

Extracted addresses

{
"locations": [
  {
    "city": "Boston",
    "zip": "02115",
    "latlng": [
      42.342514105263156,
      -71.084578526315781
    ],
    "state": "MA",
    "address": "325 Massachusetts Ave.",
    "query": "325 Massachusetts Ave",
    "type": "address"
  },

Extracted addresses

{
  "query": "Back Bay",
  "latlng": [
    42.350222043649772,
    -71.080826021602178
  ],
  "type": "neighborhood",
  "name": "Back Bay",
  "city": "BOSTON"
},

Extracted addresses

{
  "city": "BOSTON",
  "zip": "02118",
  "latlng": [
    42.337331999999996,
    -71.078071999999992
  ],
  "state": "MA",
  "address": "Massachusetts Ave. & Shawmut Ave.",
  "query": "Mass near Shawmut",
  "type": "address"
}

Address Extraction: How?

A 125-line regex!

... into which a 130-line regex is interpolated

... 11 times

A few lessons learned

Django Admin is still great

Scraping hypothesis: API is easier to deal with than framework

South migrations? -or- Multi-db? South!

Migrations must be frozen forever. So must any data they use, eg. fixtures.

Data migration example

Use a bit of boilerplate instead of fixtures:

def forwards(self, orm):
    def _create_or_update(model_id, key, attributes):
        Model = orm[model_id]
        params = {'defaults': attributes}
        params.update(key)
        ob, created = Model.objects.get_or_create(**params)
        for k, v in attributes.items():
            # get_or_create() ignores 'defaults' on updates.
            setattr(ob, k, v)
        ob.save()
        return ob

    _create_or_update('db.schema', {'slug': 'police-reports'},
                      {'map_icon_url': '/map_icons/police.png'})

What's planned?

Thanks!

Q?