Web Crawler Advice Sought

Post by **ussusimiel** » Wed Nov 20, 2013 9:53 pm

I'm looking for info about web crawlers and web spiders. What kind of information can they provide? Are there any good free ones? Is building the database behind them difficult?

I've done some study on the Web, but I'm still a bit at sea, so any advice or direction will be appreciated.

u.

Post by **wayfriend** » Wed Nov 20, 2013 10:24 pm

I've never worked on a Web Crawler specifically.

In addition to what is on web pages, a crawler could develop a map showing what page linked to what. From there, you can build models that show what kinds of things are on pages that link to a page. That's sort of what google does: it doesn't care if a page says "Fruit of the Land", it care's what other pages say that that page is about "Fruit of the Land". It doesn't analyze an image to discover it's a piece of fruit, it looks at pages that reference it and notices that they say "here's a piece of fruit".

Unless you want to crawl a very small number of pages, I would say that the database would need to be astronomical in size and highly indexed. It's certainly a "big data" class of problem. So: very difficult.

Probably you already know all this. I could maybe help with more specific questions.

Post by **Fist and Faith** » Wed Nov 20, 2013 10:29 pm

Interesting that I'd never heard of this stuff until a couple weeks ago, when I read Robert Sawyer's WWW trilogy. My understanding (the correctness of which I make no claims) is that they are what gather the information for search engines. They are sorta the things that build a database? And absolutely no idea how they're made.

Post by **ussusimiel** » Wed Nov 20, 2013 11:49 pm

I wouldn't be looking to crawl a huge number of pages, nor use a huge amount of computing power. It would be more of a steady build-up of information over time. I downloaded one free but wasn't able to get the database behind it hooked up. (I'm not including the links as I don't want anyone to go to any trouble. I asked just in case anyone had a nifty solution that might save me some of the inevitable hard slog

)

u.

Post by **Avatar** » Thu Nov 21, 2013 5:18 am

They can collect anything publicly available for the most part. Anything secure is inaccessible unless you have rights.

Depending on the tool, it might be affected by a sites robots.txt file (specifying parts of the site that should not be indexed). Or it might not.

Afraid I don't know anything about using them to build a db though, except for the fact that technically, anything they crawl and copy becomes a db.

--A

Post by **wayfriend** » Thu Nov 21, 2013 2:41 pm

u, if you tried an off the shelf crawler, it must come with instructions on how to build the database it requires. If you can't make it work, then there's a problem with their instructions. I am not sure if it requires something very specific (for example, a postgres database), or something described more abstractly (a JDBC resource), or if it expects you to write the adapter code on top of the database and link it into the code. Without more information, that's about all I can say. If you have specific database errors you are encountering, that would be a clue, too.

Post by **ussusimiel** » Sat Nov 23, 2013 12:34 am

Just a bit of an update on my crawling experience: I got a bit further with the free crawler that I'd downloaded. I needed to install MySQL so that I'd have the right server for the application. I managed to get some connectivity between the two, but then as I was digging around the web I realised that it wasn't a web crawler/spider that I was looking for but rather a web scraper (new name for me!).

That's given me a new direction to go in and I'm currently looking at some free(ish) online ones. Thanks for the input. I'll get back to you if I get stuck or make progress

u.

Post by **Avatar** » Mon Nov 25, 2013 5:17 am

And just what are you scraping...

...Never mind...I don't want to know.

--A

Post by **ussusimiel** » Mon Nov 25, 2013 8:01 pm

Avatar wrote: And just what are you scraping...

...Never mind...I don't want to know.

I don't think that you could even begin to imagine how banal the information I'm looking for is. I've started working for my brother who owns a construction machinery sales company. Part of my job is to maintain and build his mailing list. Basically I've become a part-time spam artist

If you get an odd email with a catalogue of used machinery parts just unsubscibe and forgive me. If you get another one (I've finally found out why sometimes unsubsrcibing doesn't seem to work (they just put you back on the list

)) you have permission to flame me

u.

Post by **Avatar** » Tue Nov 26, 2013 6:58 am

Actually, that is by far the most common use (and original purpose) of "scrapers." To build a mailing list. Watch out, the opt-in / opt-out requirements are getting stricter all the time.

--A

Kevin's Watch

Kevin's Watch

Web Crawler Advice Sought

Web Crawler Advice Sought