Authorship and extensibility

Feed items can now have authors associated with them in the database using a one-to-many authors-to-items relationship. Renquist by default will only use a one-to-one relationship, however (which means that there will be one author entry for each item entry). At first I struggled with the idea that there would be significant data duplication; I'm the only one writing on this blog, so why store ten identical "Kurt" authors given ten feed items? Wouldn't it make more sense to store that information only once?

Yes, but then I considered the case of comment feeds. I know already that there have been three Davids who have posted on my site. One uses "Dave", another "David", and the third showed up once and called himself "David (a different one)". There could easily be a collision of names, and it would be foolish to think that just because someone says that he's "David" that he's the same "David" who posted three other comments. Make sense?

Therefore I'm choosing to leave it up to someone else (through an as-yet-unrealized plugin framework) to decide how best to minimize duplication. Maybe the plugin could merely minimize duplication by name; easy, but perhaps not ideal in all circumstances. Maybe the plugin could minimize duplication using more advanced means (those three Davids might be writing in three separate languages, for instance). There are at least two other methods that jump to mind, but it's easy to see the potential for a smart plugin.

In the meantime, however, Renquist will store author information despite likely data duplication.

8 comments:

dancfuller said...

Why isn't the primary field for the authors the e-mail address? Those will be unique (unless people are up to something), resolving the whole "dave", "David," "Daev" issue.

An authors table would consist of the author's e-mail with a UNIQUE key, then the various other data describing that author (first name, last name, DOB, homepage, etc. etc.). Each author gets an auto-incremented ID in the author table, which is the primary key, which is then foreign keyed into the items table in an "author_id" column. Well, if you're using PostgreSQL you can use foreign keys, but I'm not sure if you're in MySQL or maybe SQLite. I'd hope it's SQLite. (also, abbreviate "foreign keyed" to "FK'd" - it's funny. Go ahead, say it out loud. it's not funny unless you say it out loud. Oh...relational database humor.")

Anyway, the duplicate data issue is taken care of, you get better referential integrity (meaning, more than the current none), and you save disk space because the item table can store the author information with the INTEGER (likely, SMALLINT) datatype instead of varchar ("Character (varying)" if you're using MySQL). Faster and more space efficient.

Anonymous said...

Almost no feed anywhere includes the email address of the author. That makes it a poor choice as a primary key. Instead, I'm using an INT as a primary key in all of my tables and maintaining foreign key relationships between them.

You're absolutely correct, though, and I'm sorry I wasn't clearer about what I'm doing.

dancfuller said...

Would the base address (or domain) of the feed be usable for using as the unique identifier? I'm not sure of the structure of RSS (or Atom), but I'd think you could use something like that and associate an author's name with the address instead of the other way around. That'd keep the different Dave's separate.

Anonymous said...

In the case of feeds, no. As a simple example, I read several of the blogs that appear on Planet Gnome. One of the bloggers aggregated there, Aaron Bockover, is also aggregated to Monologue (essentially "Planet Mono"). There are at least three distinct domains associated with the same content, and each could have different author information available.

As another example, many people are using FeedBurner for feed aggregation. Their feeds will generally be hosted on the same domain as thousands of other feeds, which is another reason why tying author information to a particular domain can cause data loss (or, worse, allow for malicious manipulation).

dancfuller said...

Alas, I had thought about feedburner and how it'd screw up my unique identifier by domain mojo after I had posted, but I didn't want to double-post. I guess I'm wondering if RSS and Atom include an author node or attribute? (not familiar with either - I could look it up, but I'm sure you know off-hand).

Separately, what about using Author Name and Feed Title together as the unique identifier? Are those always present in the RSS/Atom? I don't think there'd be overlap there if they're both present.

Anonymous said...

RSS, Atom, and RDF all /could/ have author information, but it's not required. Mark Pilgrim's feedparser library exposes available author information across these formats in a unified manner.

I'm not sure if you're talking about feed titles or feed item titles, but it's not guaranteed that either will exist. I've seen blog entries sans titles, and I've seen anonymous feeds, so I can't count on that information. I don't think it'll be necessary to create unique identifiers based on combinations of information, however.

Remember, Renquist is merely storing whatever it gets in a one-to-one relationship, Author-to-Item. It's actually a one-to-many relationship in the database (one Author, many Items), but I'm creating duplicate authors because I'm punting on duplication detection for the time being. I expect there will be a default plugin that does basic duplication.

Let me keep working towards a release, and definitely check out the code then! Thanks for your input, Dan!

dancfuller said...

Kurt, Kurt, Kurt. You post an entry about a design challenge, describe your work around, then I begin an edifying (for me at least) discussion about relational database design, which data fields are included in RSS/Atom feeds, and all I get is the passive aggressive "Let me keep working towards a release, and definitely check out the code then! Thanks for your input, Dan!" Heck, you even included the condescending exclamation marks! (see, I can play that game, too). For someone with a passion for (or at least an interest in) design patterns, I'm just trying to keep the discussion going. I'd hate to think that you've capitulated to the problem and resigned yourself to a non-normalized database schema. Next thing I'll hear is that you're a pink-blooded Communist.

That said, being that you're saying that even "Feed Title" is not a mandatory field, I think you (or anyone else trying to store authorship details) are up sh*t's creek (but you've constructed the best paddle you could), and probably don't need me more-or-less (I'd like to think 'less') harassing you about it.

Anonymous said...

Well shoot, you caught my little exclamation mark trick. I guess I'll have to work on better cloaking my deflections!

I'm glad to find out that it was you (Dan who I know) and not someone else (Dan who I know not), though; I hope our conversation this evening clarified some things, and let's keep an open line of communication!

!!!