Anyone good at regular expressions? (computing) - DiscussWorldIssues - Socio-Economic Religion and Political Uncensored Debate

**Wmshyrga** · 05-02-2007, 07:32 AM

I have posted this at a specialist forum but activity is rather slow, and was hoping maybe some of you computer scientists might be able to help me with it.

I have spent a long time today trying to build a regular expression for screen scraping a website. The (example) text it will be scanning is:

" Glouster Museums: Abbey Home Museum - Kirkhall Road, Kirkhall, Glouster, GL4 5BY, England
"

What I am trying to get from that is ONLY the museum name, and the address, with none of the html. So I would like:

Abbey Home Museum - Kirkhall Road, Kirkhall, Glouster, GL4 5BY

My regEx at the moment is -

[\w \\=\"]*\-[\w, ]*LS[\d ]*[\w, ]*land

Which returns -

Abbey Home Museum<span class=text ALIGN="justify"> - Kirkhall Road, Kirkhall, Glouster, GL4 5BY, England

This is close, but I need to omit the html tags, and ideally get rid of the trailing 'England' too. The latter is not so essential however, just getting rid of the HTML will do.

I am using this program http://www.webscrape.com/, which you run from the command line. I know it might be a bit of a long shot but does anyone have an idea of what I can try? Thanks.

**Wmshyrga** · 05-02-2007, 07:43 AM

Ok final stretch, found this which selects everything within html tags-

Now how to include that in my expression? I want it get everything in my initial expressions EXCEPT for everything in html tags ().

Anyone?

05-02-2007, 07:32 AM	#1
Wmshyrga Join Date Oct 2005 Posts 494 Senior Member	Anyone good at regular expressions? (computing) I have posted this at a specialist forum but activity is rather slow, and was hoping maybe some of you computer scientists might be able to help me with it. I have spent a long time today trying to build a regular expression for screen scraping a website. The (example) text it will be scanning is: " Glouster Museums: Abbey Home Museum - Kirkhall Road, Kirkhall, Glouster, GL4 5BY, England " What I am trying to get from that is ONLY the museum name, and the address, with none of the html. So I would like: Abbey Home Museum - Kirkhall Road, Kirkhall, Glouster, GL4 5BY My regEx at the moment is - [\w \\=\"]\-[\w, ]LS[\d ][\w, ]land Which returns - Abbey Home Museum<span class=text ALIGN="justify"> - Kirkhall Road, Kirkhall, Glouster, GL4 5BY, England This is close, but I need to omit the html tags, and ideally get rid of the trailing 'England' too. The latter is not so essential however, just getting rid of the HTML will do. I am using this program http://www.webscrape.com/, which you run from the command line. I know it might be a bit of a long shot but does anyone have an idea of what I can try? Thanks. Share Share this post on Digg Del.icio.us Technorati Twitter
	Quote

05-02-2007, 07:43 AM	#2
Wmshyrga Join Date Oct 2005 Posts 494 Senior Member	Ok final stretch, found this which selects everything within html tags- Now how to include that in my expression? I want it get everything in my initial expressions EXCEPT for everything in html tags (). Anyone? Share Share this post on Digg Del.icio.us Technorati Twitter
	Quote

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)