LOGO
General Discussion Undecided where to post - do it here.

Reply to Thread New Thread
Old 05-02-2007, 07:32 AM   #1
Wmshyrga

Join Date
Oct 2005
Posts
494
Senior Member
Default Anyone good at regular expressions? (computing)
I have posted this at a specialist forum but activity is rather slow, and was hoping maybe some of you computer scientists might be able to help me with it.

I have spent a long time today trying to build a regular expression for screen scraping a website. The (example) text it will be scanning is:

" Glouster Museums: Abbey Home Museum - Kirkhall Road, Kirkhall, Glouster, GL4 5BY, England
"

What I am trying to get from that is ONLY the museum name, and the address, with none of the html. So I would like:

Abbey Home Museum - Kirkhall Road, Kirkhall, Glouster, GL4 5BY


My regEx at the moment is -

[\w \\=\"]*\-[\w, ]*LS[\d ]*[\w, ]*land

Which returns -

Abbey Home Museum<span class=text ALIGN="justify"> - Kirkhall Road, Kirkhall, Glouster, GL4 5BY, England

This is close, but I need to omit the html tags, and ideally get rid of the trailing 'England' too. The latter is not so essential however, just getting rid of the HTML will do.

I am using this program http://www.webscrape.com/, which you run from the command line. I know it might be a bit of a long shot but does anyone have an idea of what I can try? Thanks.
Wmshyrga is offline


Old 05-02-2007, 07:43 AM   #2
Wmshyrga

Join Date
Oct 2005
Posts
494
Senior Member
Default
Ok final stretch, found this which selects everything within html tags-




Now how to include that in my expression? I want it get everything in my initial expressions EXCEPT for everything in html tags ().

Anyone?
Wmshyrga is offline



Reply to Thread New Thread

« Previous Thread | Next Thread »

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 

All times are GMT +1. The time now is 11:22 AM.
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.6.0 PL2
Design & Developed by Amodity.com
Copyright© Amodity