July 23, 2003
victory is mine!
i spent some time making my xanga backup script not suck. originally i had written this to utilize a recursive function in this format:
Step 1.) Open page
Step 2.) Clean up page (e.g. shorten it down of header/footer info so my regexps could go faster later
Step 3.) Isolate important data (entries, time, dates)
Step 4.) Upload data
Step 5.) Locate the "Next 5" link and callback this function with this page
The problem I ran into initially was the Next 5 was a bit harder to pull off than originally thought. I ran into a problem where at the last page to back-up, the script decided the "previous 5" link was actually the "next 5" link so i started an endless loop.
in any case, i fixed this, but then memory consumption became appalling. It required upwards of 60megs of ram per backup ... ridiculous. at first, i couldn't figure it out. i wasn't carelessly leaving any large arrays or variables set.
then neeraj enlightened me. i was using a tail recursion which was storing all the returned data ... not dumping it into garbage whenever a function was finished (which technically was not so until the last one had run). ah-ha!
so i learned something new today. i moved the recursive function to a loop (it's invoked through a loop now and doesn't callback on itself), and now the script runs blazingly fast; it grabbed a year's worth of xanga for me in under 20 seconds.
i've decided that instead of storing all that in memory, i'm going to dump it into the database and then manipulate it out. it'll leave for more options in the future in case i want to add more functionality to it (i don't know why).
this thing should be done by tomorrow. i've started documenting tabulas (tabulas tutorials) which should serve as a guided tour (of sorts) on how to use tabulas for you new users (since it is a bit complex).
Step 1.) Open page
Step 2.) Clean up page (e.g. shorten it down of header/footer info so my regexps could go faster later
Step 3.) Isolate important data (entries, time, dates)
Step 4.) Upload data
Step 5.) Locate the "Next 5" link and callback this function with this page
The problem I ran into initially was the Next 5 was a bit harder to pull off than originally thought. I ran into a problem where at the last page to back-up, the script decided the "previous 5" link was actually the "next 5" link so i started an endless loop.
in any case, i fixed this, but then memory consumption became appalling. It required upwards of 60megs of ram per backup ... ridiculous. at first, i couldn't figure it out. i wasn't carelessly leaving any large arrays or variables set.
then neeraj enlightened me. i was using a tail recursion which was storing all the returned data ... not dumping it into garbage whenever a function was finished (which technically was not so until the last one had run). ah-ha!
so i learned something new today. i moved the recursive function to a loop (it's invoked through a loop now and doesn't callback on itself), and now the script runs blazingly fast; it grabbed a year's worth of xanga for me in under 20 seconds.
i've decided that instead of storing all that in memory, i'm going to dump it into the database and then manipulate it out. it'll leave for more options in the future in case i want to add more functionality to it (i don't know why).
this thing should be done by tomorrow. i've started documenting tabulas (tabulas tutorials) which should serve as a guided tour (of sorts) on how to use tabulas for you new users (since it is a bit complex).
Posted by roy on July 23, 2003 at 12:18 PM | 1 Comments
Comment with Facebook
Want to comment with Tabulas?. Please login.