On Thu, 24 Mar 2005, Sonny Parafina wrote:
> You might want to look at Schuyler Erle's geocoder.us > > http://geocoder.us/ > > Source is available and its written in perl. Its pretty nifty but it > doesn't handle mis-spelling, you would probably need a soundex to handle > that.
Thanks! I had forgotten about that. But then I looked at the source code, and realized that it's a bit beyond me. I think what I want to know is right here in US.pm:
our %Addr_Match = ( type => join("|", keys %Geo::Coder::US::Codes::_Street_Type_List), number => qr/d+-?d*/, state => join("|", %Geo::Coder::US::Codes::State_Code), direct => join("|", %Geo::Coder::US::Codes::Directional), dircode => join("|", keys %Geo::Coder::US::Codes::Direction_Code), zip => qr/d{5}(?:-d{4})?/, corner => qr/(?:and|at|&|@)/i, unit => qr/(?:pmb|ste|suite|dept|apt|room)W+w+/i, ):
{ use re 'eval': $Addr_Match{street} = qr/ (?:($Addr_Match{direct})W+ (?{ $_{prefix} = $^N }))? (?: ([^,]+) (?{ $_{street} = $^N }) (?:[^w,]+($Addr_Match{type}) (?{ $_{type} = $^N })) (?:[^w,]+($Addr_Match{direct}) (?{ $_{suffix} = $^N }))? etc....
But I don't understand this. I've also received another response that suggested that regular expressions are the way to go. Unless someone can explain (in english) what's going on in the re logic above, I might just have to read the O'Reilly Camel book!
- Bill Thoen
> > sonny > > -----Original Message----- > From: gislist-bounces@lists.thinkburst.com > [mailto:gislist-bounces@lists.thinkburst.com]On Behalf Of Bill Thoen > Sent: Thursday, March 24, 2005 9:23 PM > To: gislist@lists.thinkburst.com > Subject: [gislist] Address Parsing for Standardization and Geocoding > > > I'm looking for advice and algorithms for splitting US addresses into > street number, prefix direction, street name, street type, suffix > direction and unit. The problem I have is that the addresses I'm working > with have all these logical fields combined into one physical field and > the elements are not standardized. For example, the information in the > street field may vary a lot. There may or may not be direction information > or even street types. You can't be sure that the second word represents > the prefix direction, and it's really hard to tell which word is the last > one of the street name and whether the next word is the street type, > suffix direction or unit. Also some street names are spelled differently, > like "Woody Creek Rd" and "Woody Crk Rd." > > Any suggestions on how to approach this problem? I'm currently working on > this in an Access database, and I can handle SQL and VBA programming > without too much difficulty. I'm just wondering how big a problem this is, > and how to break it down into smaller problems. > > I did find some general articles on street elements via Google, and I know > where to get the USPS abbreviations for street types and directions, but I > haven't found any technical details on how to parse and standardize > addresses, so before I try to start from scratch, I thought I'd ask and > see what ideas and pointers that others might have. > > - Bill Thoen > > > _______________________________________________ > gislist mailing list > gislist@lists.geocomm.com > http://lists.geocomm.com/mailman/listinfo/gislist > > _________________________________ > This list is brought to you by > The GeoCommunity > http://www.geocomm.com/ > > Get Access to the latest GIS & Geospatial Industry RFPs and bids > http://www.geobids.com >
_______________________________________________ gislist mailing list gislist@lists.geocomm.com http://lists.geocomm.com/mailman/listinfo/gislist
_________________________________ This list is brought to you by The GeoCommunity http://www.geocomm.com/
Get Access to the latest GIS & Geospatial Industry RFPs and bids http://www.geobids.com
|