Article: AT Hiking Rates, Section by Section

**Alligator** · 02-13-2006, 14:51

Originally Posted by dje97001

Yeah thanks pal. I know that (lepto, platy, meso...). I was giving you options... consider it a grammatical mistake.

Your welcome.

Originally Posted by dje97001

...
The point I'm trying to make--one that Chris made to me a while ago (but it took a while to accept)--is that there are already too many numbers for most people to spend much time on. Hikers understand that their mileage may vary. Confidence intervals, while very useful in gaining more precision (most of the time extremely desirable), in this case will simply obscure the value of these numbers to most people (i.e. they don't want to know that there is a 95% chance of them making it from Springer to the Georgia border in 7.25 to 8.45 days) they just want the best guess, for which the mean (or median) should suffice.

Most estimates that people give are in the days range. As an example, 5-7 days to finish the Smokies. I'd want to know what this range is if I was a slow hiker, so I could plan on eating the last day. Of course YMMV, that's why a confidence interval or even a range is much better than a point estimate. I think most folks can understand 6 days give or take a day. Those interested, could look to the right of the estimate or ignore it. In a similar vein, the estimate of 167.8 days would be vastly improved by saying, it took them on average 168 days +/- 21 days. (I made the interval up.) This certainly would give an impression that there is a lot of variability. In particular, being risk averse, I would want the upper bounds. Then a hiker could have a conservative, reliable estimate as to how much time, money, and supplies are necessary for the journey.

[Note to Map man-Actually, what I think would be better is to take say the 10th and 90th percentiles for the time it takes to do a section, along with the median. A confidence interval for the mean still relates to the mean. But the 10th and 90 percentiles would give you a good idea of the range yet would exclude extreme outliers. It would also be distribution free.]

Originally Posted by dje97001

Yes the pace for april starters may be different from march or feb starters for only the first 2 sections...but map man didn't ask for a critique. He didn't even ask that it be placed in the articles section. Such a detailed criticism without prompting will only make it less likely for people to share potentially valuable information. I for one, think that in its present form it is definitely of value to the hiking community.

That set of numbers map man listed is complicated enough.

As far as I know, MM posted this in the Articles section. The forum where this thread is currently located is not the finished articles section. Once an article gets feedback, it gets elevated to the completed articles section. I don't see your view of MM's intent as being correct. Further, within limits, any topics placed on the site are open to discussion, it is a public forum. Now, if you could, please give MM a chance to speak if he so chooses. Thanks pal. I'm so happy I have a new buddy!

**map man** · 02-14-2006, 02:35

I assumed there would be people here at WhiteBlaze with more knowledge of statistical methods than I have and I'm feeling pretty good that some of you are giving me some thoughtful advice on my proposed article. After thinking about what dje97001, Alligator and Tha Wookie in particular have had to say I've spent some time this evening calculating some medians to incorporate in the study. I've already edited my article to add the median for total days hiked and total zero days taken and I've calculated the median for days taken to hike each section, though I'm still debating how best to include those figures in the article. And it's getting too late in the evening for me to think clearly about it at the moment.

I also incorporated a suggestion of ALHikerGal to mention the gender breakdown of the 143 hikers in the study and that, too, I've already edited into the article. Ages of the hikers is impossible to know because just like here at WhiteBlaze, not everyone at Trailjournals chooses to reveal their age. I'm not going to break down the hiking rates by gender in the article because the number of female hikers in the study at this point is not high enough to make the numbers meaningful, I think.

I've got to be candid with Alligator and Tha Wookie -- I don't know how to calculate confidence intervals and though I know what an "outlier" is, I don't know the statistical methods for figuring out just how outlandish (wink, wink) an oddball bit of data needs to be to throw it out. And no, this is not an invitation for anyone to give me a crash course. I'm thinking over Alligator's idea for giving the 10th and 90th percentile values for hiking days per section as a way of dealing with extreme numbers at either end, because that is something I do know how to do. But I'm still thinking about it. I'll post more on this in the next day or two.

Finally, Topcat's interest in seeing my raw data is something I've also been thinking about. Right now the data is written out by hand (in very small print) on several tally sheets, but I've known all along that it would probably be a good idea to convert this stuff to an electronic spreadsheet of some kind, and this just provides extra motivation (but as of March 2008, I still haven't done it

).

And by the way, I should mention that my original posting of the article was indeed in the "Articles Forum" because I intended from the beginning for it to be an article if it passed muster. But since three or four of the first posters said something like, "hey, this should go in the articles section," I can understand why it wasn't clear to some that that was my intent. Anyway, every post that I've seen that has made suggestions for the article has in my view been in the spirit of wanting to see the article be as good as it can be, and for that I'm thankful. Actually, the thing I feared most was that after my months of work the article might be greeted with utter indifference, and it's clear from all the responses and views the thread has gotten in a little over 24 hours that this isn't the case.

**dje97001** · 02-15-2006, 06:15

It appears my perception was in error. I still stand by my statement that too many numbers will only confuse the issue--but I'm willing to accept that I am in the minority on this. So, since map man doesn't mind, feel free to edit away. Enjoy!

**Heater** · 02-15-2006, 06:32

Wow!

Great post, Map Man.

**Bilko** · 02-15-2006, 10:25

map man. Thanks for the work. As a section hiker I often looked at different journals and tried to figure out how long it took them to hike certain sections. I can actually see how my section hikes fit into a thru-hike. Your work allows us to see how long it took the people that were able to document their achievements. I enjoyed the study greatly. I liked the way you broke it into the 11 sections, I enjoyed looking at the tables and your explanations of how and why you counted days etc., the way you did.
How long did it take you? Did you make copies of all the journals? Did some of the journals seem unlikely to have occurred? Your next assignment.... find out common occurences that happen to people to drop out before the first section or by Fontana. My guess is improper food and dehydration. However, you may never actually find out the reason. Which may be best.

**ARambler** · 02-15-2006, 19:11

0) I seem to be the only poster who has actually used your data. I was going to complain about all of the bandwidth wasted by those who say "the users are too stupid to use so much data" or "I'm so smart I can't use your data, unless you provide the gene sequence on chromosome 18 for each hiker who starts at Springer." However, I see the updates you have made so far are really good. Keep up the good work. I'll get back to how I've used and would like to use the data.

1) You have two types of very useful quantitative data: How people hike, and how people don't hike. My zero days were sporadic, and did not correlate well with your averages. So, as I posted earlier, I looked at my Hiked Days versus your hiked days. I calculated your hiked days by multiplying your mean number of days by (100-%zero)/100. I get essentially the same numbers by taking (miles/section)/your miles per hiked day. This is my calculation:

<TABLE style="WIDTH: 144pt; BORDER-COLLAPSE: collapse" cellSpacing=0 cellPadding=0 width=192 border=0 x:str><COLGROUP><COL style="WIDTH: 48pt" span=3 width=64><TBODY><TR style="HEIGHT: 12.75pt" height=17><TD style="BORDER-RIGHT: #d4d0c8; BORDER-TOP: #d4d0c8; BORDER-LEFT: #d4d0c8; WIDTH: 48pt; BORDER-BOTTOM: #d4d0c8; HEIGHT: 12.75pt; BACKGROUND-COLOR: transparent" width=64 height=17>yourDays/sec </TD><TD style="BORDER-RIGHT: #d4d0c8; BORDER-TOP: #d4d0c8; BORDER-LEFT: #d4d0c8; WIDTH: 48pt; BORDER-BOTTOM: #d4d0c8; BACKGROUND-COLOR: transparent" width=64>%Zero /sec</TD><TD style="BORDER-RIGHT: #d4d0c8; BORDER-TOP: #d4d0c8; BORDER-LEFT: #d4d0c8; WIDTH: 48pt; BORDER-BOTTOM: #d4d0c8; BACKGROUND-COLOR: transparent" width=64>HikeD/sec</TD></TR><TR style="HEIGHT: 12.75pt" height=17><TD class=xl24 style="BORDER-RIGHT: #d4d0c8; BORDER-TOP: #d4d0c8; BORDER-LEFT: #d4d0c8; BORDER-BOTTOM: #d4d0c8; HEIGHT: 12.75pt; BACKGROUND-COLOR: transparent" align=right height=17 x:num>7.95</TD><TD class=xl24 style="BORDER-RIGHT: #d4d0c8; BORDER-TOP: #d4d0c8; BORDER-LEFT: #d4d0c8; BORDER-BOTTOM: #d4d0c8; BACKGROUND-COLOR: transparent" align=right x:num>5.50</TD><TD class=xl24 style="BORDER-RIGHT: #d4d0c8; BORDER-TOP: #d4d0c8; BORDER-LEFT: #d4d0c8; BORDER-BOTTOM: #d4d0c8; BACKGROUND-COLOR: transparent" align=right x:num="7.5127499999999996" x:fmla="=A2*(100-B2)/100">7.51</TD></TR><TR style="HEIGHT: 12.75pt" height=17><TD class=xl24 style="BORDER-RIGHT: #d4d0c8; BORDER-TOP: #d4d0c8; BORDER-LEFT: #d4d0c8; BORDER-BOTTOM: #d4d0c8; HEIGHT: 12.75pt; BACKGROUND-COLOR: transparent" align=right height=17 x:num>7.71</TD><TD class=xl24 style="BORDER-RIGHT: #d4d0c8; BORDER-TOP: #d4d0c8; BORDER-LEFT: #d4d0c8; BORDER-BOTTOM: #d4d0c8; BACKGROUND-COLOR: transparent" align=right x:num>4.60</TD><TD class=xl24 style="BORDER-RIGHT: #d4d0c8; BORDER-TOP: #d4d0c8; BORDER-LEFT: #d4d0c8; BORDER-BOTTOM: #d4d0c8; BACKGROUND-COLOR: transparent" align=right x:num="7.35534" x:fmla="=A3*(100-B3)/100">7.36</TD></TR><TR style="HEIGHT: 12.75pt" height=17><TD class=xl24 style="BORDER-RIGHT: #d4d0c8; BORDER-TOP: #d4d0c8; BORDER-LEFT: #d4d0c8; BORDER-BOTTOM: #d4d0c8; HEIGHT: 12.75pt; BACKGROUND-COLOR: transparent" align=right height=17 x:num>24.34</TD><TD class=xl24 style="BORDER-RIGHT: #d4d0c8; BORDER-TOP: #d4d0c8; BORDER-LEFT: #d4d0c8; BORDER-BOTTOM: #d4d0c8; BACKGROUND-COLOR: transparent" align=right x:num>13.00</TD><TD class=xl24 style="BORDER-RIGHT: #d4d0c8; BORDER-TOP: #d4d0c8; BORDER-LEFT: #d4d0c8; BORDER-BOTTOM: #d4d0c8; BACKGROUND-COLOR: transparent" align=right x:num="21.175799999999999" x:fmla="=A4*(100-B4)/100">21.18</TD></TR><TR style="HEIGHT: 12.75pt" height=17><TD class=xl24 style="BORDER-RIGHT: #d4d0c8; BORDER-TOP: #d4d0c8; BORDER-LEFT: #d4d0c8; BORDER-BOTTOM: #d4d0c8; HEIGHT: 12.75pt; BACKGROUND-COLOR: transparent" align=right height=17 x:num>28.60</TD><TD class=xl24 style="BORDER-RIGHT: #d4d0c8; BORDER-TOP: #d4d0c8; BORDER-LEFT: #d4d0c8; BORDER-BOTTOM: #d4d0c8; BACKGROUND-COLOR: transparent" align=right x:num>15.30</TD><TD class=xl24 style="BORDER-RIGHT: #d4d0c8; BORDER-TOP: #d4d0c8; BORDER-LEFT: #d4d0c8; BORDER-BOTTOM: #d4d0c8; BACKGROUND-COLOR: transparent" align=right x:num="24.2242" x:fmla="=A5*(100-B5)/100">24.22</TD></TR><TR style="HEIGHT: 12.75pt" height=17><TD class=xl24 style="BORDER-RIGHT: #d4d0c8; BORDER-TOP: #d4d0c8; BORDER-LEFT: #d4d0c8; BORDER-BOTTOM: #d4d0c8; HEIGHT: 12.75pt; BACKGROUND-COLOR: transparent" align=right height=17 x:num>11.32</TD><TD class=xl24 style="BORDER-RIGHT: #d4d0c8; BORDER-TOP: #d4d0c8; BORDER-LEFT: #d4d0c8; BORDER-BOTTOM: #d4d0c8; BACKGROUND-COLOR: transparent" align=right x:num>15.10</TD><TD class=xl24 style="BORDER-RIGHT: #d4d0c8; BORDER-TOP: #d4d0c8; BORDER-LEFT: #d4d0c8; BORDER-BOTTOM: #d4d0c8; BACKGROUND-COLOR: transparent" align=right x:num="9.6106800000000003" x:fmla="=A6*(100-B6)/100">9.61</TD></TR><TR style="HEIGHT: 12.75pt" height=17><TD class=xl24 style="BORDER-RIGHT: #d4d0c8; BORDER-TOP: #d4d0c8; BORDER-LEFT: #d4d0c8; BORDER-BOTTOM: #d4d0c8; HEIGHT: 12.75pt; BACKGROUND-COLOR: transparent" align=right height=17 x:num>19.32</TD><TD class=xl24 style="BORDER-RIGHT: #d4d0c8; BORDER-TOP: #d4d0c8; BORDER-LEFT: #d4d0c8; BORDER-BOTTOM: #d4d0c8; BACKGROUND-COLOR: transparent" align=right x:num>17.00</TD><TD class=xl24 style="BORDER-RIGHT: #d4d0c8; BORDER-TOP: #d4d0c8; BORDER-LEFT: #d4d0c8; BORDER-BOTTOM: #d4d0c8; BACKGROUND-COLOR: transparent" align=right x:num="16.035599999999999" x:fmla="=A7*(100-B7)/100">16.04</TD></TR><TR style="HEIGHT: 12.75pt" height=17><TD class=xl24 style="BORDER-RIGHT: #d4d0c8; BORDER-TOP: #d4d0c8; BORDER-LEFT: #d4d0c8; BORDER-BOTTOM: #d4d0c8; HEIGHT: 12.75pt; BACKGROUND-COLOR: transparent" align=right height=17 x:num>12.32</TD><TD class=xl24 style="BORDER-RIGHT: #d4d0c8; BORDER-TOP: #d4d0c8; BORDER-LEFT: #d4d0c8; BORDER-BOTTOM: #d4d0c8; BACKGROUND-COLOR: transparent" align=right x:num>13.80</TD><TD class=xl24 style="BORDER-RIGHT: #d4d0c8; BORDER-TOP: #d4d0c8; BORDER-LEFT: #d4d0c8; BORDER-BOTTOM: #d4d0c8; BACKGROUND-COLOR: transparent" align=right x:num="10.619840000000002" x:fmla="=A8*(100-B8)/100">10.62</TD></TR><TR style="HEIGHT: 12.75pt" height=17><TD class=xl24 style="BORDER-RIGHT: #d4d0c8; BORDER-TOP: #d4d0c8; BORDER-LEFT: #d4d0c8; BORDER-BOTTOM: #d4d0c8; HEIGHT: 12.75pt; BACKGROUND-COLOR: transparent" align=right height=17 x:num>23.11</TD><TD class=xl24 style="BORDER-RIGHT: #d4d0c8; BORDER-TOP: #d4d0c8; BORDER-LEFT: #d4d0c8; BORDER-BOTTOM: #d4d0c8; BACKGROUND-COLOR: transparent" align=right x:num>10.00</TD><TD class=xl24 style="BORDER-RIGHT: #d4d0c8; BORDER-TOP: #d4d0c8; BORDER-LEFT: #d4d0c8; BORDER-BOTTOM: #d4d0c8; BACKGROUND-COLOR: transparent" align=right x:num="20.798999999999999" x:fmla="=A9*(100-B9)/100">20.80</TD></TR><TR style="HEIGHT: 12.75pt" height=17><TD class=xl24 style="BORDER-RIGHT: #d4d0c8; BORDER-TOP: #d4d0c8; BORDER-LEFT: #d4d0c8; BORDER-BOTTOM: #d4d0c8; HEIGHT: 12.75pt; BACKGROUND-COLOR: transparent" align=right height=17 x:num>9.60</TD><TD class=xl24 style="BORDER-RIGHT: #d4d0c8; BORDER-TOP: #d4d0c8; BORDER-LEFT: #d4d0c8; BORDER-BOTTOM: #d4d0c8; BACKGROUND-COLOR: transparent" align=right x:num>10.20</TD><TD class=xl24 style="BORDER-RIGHT: #d4d0c8; BORDER-TOP: #d4d0c8; BORDER-LEFT: #d4d0c8; BORDER-BOTTOM: #d4d0c8; BACKGROUND-COLOR: transparent" align=right x:num="8.6207999999999991" x:fmla="=A10*(100-B10)/100">8.62</TD></TR><TR style="HEIGHT: 12.75pt" height=17><TD class=xl24 style="BORDER-RIGHT: #d4d0c8; BORDER-TOP: #d4d0c8; BORDER-LEFT: #d4d0c8; BORDER-BOTTOM: #d4d0c8; HEIGHT: 12.75pt; BACKGROUND-COLOR: transparent" align=right height=17 x:num>9.79</TD><TD class=xl24 style="BORDER-RIGHT: #d4d0c8; BORDER-TOP: #d4d0c8; BORDER-LEFT: #d4d0c8; BORDER-BOTTOM: #d4d0c8; BACKGROUND-COLOR: transparent" align=right x:num>10.00</TD><TD class=xl24 style="BORDER-RIGHT: #d4d0c8; BORDER-TOP: #d4d0c8; BORDER-LEFT: #d4d0c8; BORDER-BOTTOM: #d4d0c8; BACKGROUND-COLOR: transparent" align=right x:num="8.8109999999999999" x:fmla="=A11*(100-B11)/100">8.81</TD></TR><TR style="HEIGHT: 12.75pt" height=17><TD class=xl25 style="BORDER-RIGHT: #d4d0c8; BORDER-TOP: #d4d0c8; BORDER-LEFT: #d4d0c8; BORDER-BOTTOM: windowtext 0.5pt solid; HEIGHT: 12.75pt; BACKGROUND-COLOR: transparent" align=right height=17 x:num>13.77</TD><TD class=xl25 style="BORDER-RIGHT: #d4d0c8; BORDER-TOP: #d4d0c8; BORDER-LEFT: #d4d0c8; BORDER-BOTTOM: windowtext 0.5pt solid; BACKGROUND-COLOR: transparent" align=right x:num>6.60</TD><TD class=xl25 style="BORDER-RIGHT: #d4d0c8; BORDER-TOP: #d4d0c8; BORDER-LEFT: #d4d0c8; BORDER-BOTTOM: windowtext 0.5pt solid; BACKGROUND-COLOR: transparent" align=right x:num="12.861179999999999" x:fmla="=A12*(100-B12)/100">12.86</TD></TR><TR style="HEIGHT: 12.75pt" height=17><TD class=xl24 style="BORDER-RIGHT: #d4d0c8; BORDER-TOP: #d4d0c8; BORDER-LEFT: #d4d0c8; BORDER-BOTTOM: #d4d0c8; HEIGHT: 12.75pt; BACKGROUND-COLOR: transparent" align=right height=17 x:num x:fmla="=SUM(A2:A12)">total 167.83</TD><TD class=xl24 style="BORDER-RIGHT: #d4d0c8; BORDER-TOP: #d4d0c8; BORDER-LEFT: #d4d0c8; BORDER-BOTTOM: #d4d0c8; BACKGROUND-COLOR: transparent" align=right x:num>12.00</TD><TD class=xl24 style="BORDER-RIGHT: #d4d0c8; BORDER-TOP: #d4d0c8; BORDER-LEFT: #d4d0c8; BORDER-BOTTOM: #d4d0c8; BACKGROUND-COLOR: transparent" align=right x:num="147.69039999999995" x:fmla="=A13*(100-B13)/100">147.69</TD></TR></TBODY></TABLE>

2) I like to use the hiking days data separately. Also, especially from a statistical point of view, you should not combine hiking and not-hiking without testing for independence. What does a simple plot of the number of zero days versus the number of days hiked look like? I would guess that the slope below the mean would be more than proportional, i.e. reducing the days hiked in half from 148 to 74 would reduce the zero days by more than from 20 to 10. (This would presumably be an extrapolation of the line.) However, at the upper end, The slope might be less than proportional. A hypothetical hiker doing 148+74 = 222 hiked days, might have to reduce the number of zero days to complete the hike before winter. Therefore, Table 4 could be off a little.

3) Similarly, I'm a little reluctant to build a Table 4b, for a "typical Hiking Days" number for 4, 5, 6, and 7 months. I would want to know whether the sub-130 day people started in good physical shape (and aggressive mentality) and sped up by the same or less percentage as others, or the sub-130 day people started the same or slightly faster and sped up significantly more than the average. I'd have similar concerns about 180+ day hikers speeding up in Maine to beat the snow. Note, "Table 4b" methodology worked very well for me.

4) Skewed distributions: Thanks for the detailed mean versus median data. It is interesting to see how the mean catches up to the median. I guess the data show that the people who start very late, have to catch up relative to the median.
It does not surprise me that the total days distribution is skewed to the left and consequently the median is higher than the mean. Similar arguments to 2) above would make me believe that it is significantly more likely that a hiker would finish in 122 days (168 - 46) than in more than 214 days. (This may just be that 4 month people brag more than 7 month people.) This feature of the data will make it difficult to use statistical tests that rely on normality for the probability of finishing in a given time. So what? Hikers should worry about being in the 20 % of those who finish, not worrying about being in the top 10 % (2 % of starters) or having enough time to be in the bottom 10 % of this 20%.

5) Outliers: I'm surprised outliers did not seem to be a concern to you. In 2005, Apple Pie left the trail in Erwin for about 50 days. This is over half of the LTB you report for the Fontana to Damascus section. Similarly, Stumpknocker took almost 365 days to hike the trail in 2004, but he hikes at a 4 month pace. (Hippy LS also did 360+ days but I don't think she had a complete TJournal. FB & Silver Girl took > 80 days off but their journal was not on Trailjournals.) I don't have much of a point to make about outliers, they are a part of life and a part of life on the AT. However, I think they are more a factor affecting zero days, and that's another reason for separating out zero days.

6) Variability:
a) By far the most common expression of variability is the standard deviation, or variance=std.dev. squared. It should be calculated using a spreadsheet of the standard deviation function on a programmable calculator. If you have to calculate it by hand:
Std.Dev^2 = [sum of each (day squared) - n*Ave^2]/(n-1) = [sum(Di*Di) - 2,957,525]/104; where Di = total days for each hiker, i, and 2,957,525 = 105 hikers*167.83*167.83 average days. For the Hiked Days it would be: Std.Dev^2=[sum(di*di) - 2,290,295]/104; Note, 105*147.69*147.69 = 2,290,295.
I hope you have the Excel function for sample standard deviation.
b) The easiest calculation for variability is range; just the longest minus slowest days. I believe one must assume a normal distribution to convert Range to (unbiased) variance (std.dev^2).
c) The other commonly used expression for variability is a confidence range. The most common range is a 95 % interval which for a normal distribution is about plus or minus 2 std. dev. from the mean. I recommend against using these confidence intervals with such skewed distributions. Note, the 95% confidence range means 2.5 % faster and 2.5 % slower. Since the normal assumption makes the estimates symmetric, the confidence interval is often expressed as Average +/- Interval/2. e.g. 168 +/- 30 days. for a std dev about 15 days.
d) You could report the actual % interval as a pseudo-confidence interval. Just figure out the number of days that 2.6 hikers were slower and another number which 2.6 hikers were faster. What has been proposed is reporting the lower and upper numbers of -10% and +90 %. For the section data, I think you will find it difficult to interpolate between whole days for the -10%/+90% number, which in my mind is an arbitrary, non-standard percentile, pseudo-confidence interval. The number might be highly dependent on how many people hiked through the section boundary on one day.
e) If you remove the variability associated with the zero days, you might be able to give a good representation for hiking variability just by reporting aggregate data. This data might also be easiest to understand and use in a statistic free way. I propose to aggregate the data for each section into five groups. Because the distances vary by such a large amount, the intervals for the groupings should also vary. I suggest that each of the five groups vary by m=1, 2, or 3 days. You would then report 8 values/section: g1.days, m, n.g1, n.g2, n.g3, n.g4, n.g5, Slow. I'm not sure whether the g1.days should be integer and the start of the interval. Assuming that it is, you would get numbers like:
5, 1, 12, 21, 32, 19, 11, 2. For the first section, 12 hikers would reach the GA line in 5.0 to 5.9 days, 21 hikers would reach the border in 6.0 to 6.9 days, 32 hikers in 7.0 to 7.9 days, 19 hikers in 8 to 8.9 days, 11 hikers in 9.0 to 9.9 days and 2 hiker over 9.9 days (optional). By calculation, 105-(12+21+32+19+11+2)=3 hikers less than 5.0 days. The relative distribution for the Damascus to Waynesboro will not be exactly the same, but if it was, the data would be reported as 17, 3, 12, 21, 32, 19, 11, 2. and the groupings would be: 17 to 19.9 days, 20 to 22.9 days, 23 to 25.9, 26 to 28.9 days, and 29 to 31.9 days. Slow hikers would look at this raw data and see 11 in 105 needed 8- 8.9 days food to reach the GA border and 23 to 25.9 days to get to Waynesboro, and would plan on packing this amount. (Hopefully, not all at once.)
f) I will be very interested if the variability is significantly different for the first couple and last couple of sections.
Rambler

**domnokmis** · 02-15-2006, 21:07

Originally Posted by dje97001

While it can be beneficial in improving the research, it also can be perceived as really jerky (esp. to people who aren't in academia). It is always easier to critique a study than to conduct one yourself.

I thought he was being pretty jerky, myself. And I'm not in academia, so you are obviously correct. Perhaps if I were more insulated from the practical, I could be more picky.

But as far as I can tell, he applied the study to something the author did not extend it to, then said, it can't be used for this purpose. Duh.

Besides, you can take ANY study and trash it as he did.

Different years might make a difference? Sure so the author coorelates them by year. Ah, but he didn't combine dry years and rainy years, did he. So he does. Ah, but one of his rainy years was really a dry year with a hurricane that skewed the numbers. So he puts it with dry years. Ah, but it was wet by definition of # of inches of rain. So the author puts it with the wet years. Ah, but it was a dry year with a hurricane.

So the author drops the year in question.

AH HA! Now you are selecting data!!!!!! Bad bad bad.

Jerky, as you said.

Great study, good to use to see how you hike is lining up with those who made it to the end.

Maybe sometime someone can make up a similar one about re-supply, if academicians can get over the use of caches and hiker boxes.

**map man** · 02-16-2006, 00:42

(I'm planning to use this post as the one place where I answer questions about the procedures I used in my AT Hiking Rate study, as well as more detailed information about the resulting data [info that in my judgment might bog down the article], and my responses to suggestions from WhiteBlaze members with expertise in statistical methods.)

A couple of the questions so far have dealt with how I collected the raw data, so I will talk a little about that here. First, I did not have to read through every entry of every journal. I already mentioned in the article that I only bothered looking at the journals with at least 70 entries (the number of entries is included on the page that lists the journals for each year) and that eliminated most of the journals right away. When I did look at a journal the first thing I did was forward to the last entries to see if the hike ended at Katahdin and if it did I looked back at the first entry to see if it started at Springer, and if it did I quickly scrolled through the listing of dates for journal entries to see if there were any gaping gaps and if there were I quickly looked at the entries on either side of the date gap to see if that meant that trail was skipped or a section of the hike was not detailed thoroughly enough for my study (and these quick steps eliminated a whole lot more journals), and only then did I go back to the journal start and begin tallying day by day things like zero days and the dates that landmarks were passed.

When doing this, if again the journal keeping was not thorough enough to reveal the info I wanted to collect, most of the time this revealed itself fairly quickly and I could move on to the next journal. Sometimes I would get a long way into the hike before the journal omitted info for a section and when this happened I would just have to shrug my shoulders and move on.

I kept two tally sheets for each journal. On one tally sheet I wrote down three things for each landmark a hiker reached: the date the landmark was reached, the number of days passed since the last landmark and the cumulative number of days for the whole hike up to that point. It looked something like this:

NAME OF HIKER ~~~ GEORGIA BORDER ~~~~ FONTANA etc. etc.
John Doe (March 1)......March 8 (8) [8]...............March 19 (11) [19]........
Jane Doe (March 12)....March 18 (6.7) [6.7]........March 25 (7.3) [14]......

I would write very small so I could get an entire year's worth of hikers on one sheet of paper. On the second tally sheet I would keep track of zero days taken in each section. It would look something like this:

NAME OF HIKER ~~ GEORGIA BORDER ~~~~~~~ FONTANA
John Doe.................1,1.....(2) [0] {2}..................1,4,1...(2) [4] {6}.....
Jane Doe..........................(0) [0] {0}..................1........(1) [0] {1}.....

In this case John Doe took 2 one day breaks in the first section and then a one day, four day, and one day break in the next section. The numbers in the various brackets are, respectively: total days taken in short term breaks in that section, total days taken in long breaks in that section, and grand total of zero days for that section. I would only fill these bracketed numbers in when I had gotten to the end of that hiker's journal.

Now for those who want more detail about the distributions of the data, here are some illustrations in the form of primative histograms. In each case I set the bin boundaries that the data are divided into before tabulating the data to try to prevent bias. First, here's the distribution of the days taken to complete the AT. On the left are the ranges for number of days and the number in parentheses that follows is the number of hikers who fall in that range (in the illustrations that follow if there are any outliers that are not practical to illustrate, I list them without graphics on the tail of the data they belong in):

(The following five illustrations have been updated in February 2011 to include the 2001 through 2010 hiker classes.)

ILLUSTRATION 1 -- Days to Complete AT

080-089 (01): X
090-099 (01): X
100-109 (05): XXXXX
110-119 (07): XXXXXXX
120-129 (06): XXXXXX
130-139 (09): XXXXXXXXX
140-149 (23): XXXXXXXXXXXXXXXXXXXXXXX
150-159 (25): XXXXXXXXXXXXXXXXXXXXXXXXX
160-169 (39): XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
170-179 (40): XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
180-189 (28): XXXXXXXXXXXXXXXXXXXXXXXXXXXX
190-199 (31): XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
200-209 (18): XXXXXXXXXXXXXXXXXX
210-219 (04): XXXX
220-229 (01): X
230-239 (01): X
282 (1)

This is the distribution of the days actually spent hiking the AT (covering at least one tenth of a mile), excluding zero days:

ILLUSTRATION 2 -- Hiking Days to Complete AT

080-089 (01): X
090-099 (03): XXX
100-109 (09): XXXXXXXXX
110-119 (08): XXXXXXXX
120-129 (16): XXXXXXXXXXXXXXXX
130-139 (30): XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
140-149 (50): XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
150-159 (52): XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XX
160-169 (39): XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
170-179 (24): XXXXXXXXXXXXXXXXXXXXXXXX
180-189 (07): XXXXXXX
190-199 (01): X

This is the distribution of the total number of zero days taken during the course of hiking the AT:

ILLUSTRATION 3 -- Total Zero Days Taken

00-04 (09): XXXXXXXXX
05-09 (34): XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
10-14 (33): XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
15-19 (45): XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
20-24 (49): XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
25-29 (29): XXXXXXXXXXXXXXXXXXXXXXXXXXXXX
30-34 (18): XXXXXXXXXXXXXXXXXX
35-39 (11): XXXXXXXXXXX
40-44 (06): XXXXXX
45-49 (00):
50-54 (02): XX
55-59 (00):
60-64 (01): X
65-69 (01): X
70-74 (01): X
122 (01)

Here's the distribution of days devoted to Short Term Breaks (zero days of only one or two days in duration):

ILLUSTRATION 4 -- Zero Days Taken in Short Term Breaks

00-01 (04): XXXX
02-03 (08): XXXXXXXX
04-05 (13): XXXXXXXXXXXXX
06-07 (23): XXXXXXXXXXXXXXXXXXXXXXX
08-09 (31): XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
10-11 (32): XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
12-13 (32): XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
14-15 (27): XXXXXXXXXXXXXXXXXXXXXXXXXXX
16-17 (16): XXXXXXXXXXXXXXXX
18-19 (19): XXXXXXXXXXXXXXXXXXX
20-21 (18): XXXXXXXXXXXXXXXXXX
22-23 (06): XXXXXX
24-25 (06): XXXXXX
26-27 (01): X
28-29 (02): XX
30-31 (01): X
37 (01)

Here's the distribution of days devoted to Long Term Breaks (zero days of at least three straight days):

ILLUSTRATION 5 -- Zero Days Taken in Long Term Breaks

00-00 (66): XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXX
03-04 (42): XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
05-06 (30): XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
07-08 (19): XXXXXXXXXXXXXXXXXXX
09-10 (18): XXXXXXXXXXXXXXXXXX
11-12 (13): XXXXXXXXXXXXX
13-14 (13): XXXXXXXXXXXXX
15-16 (08): XXXXXXXX
17-18 (07): XXXXXXX
19-20 (04): XXXX
21-22 (03): XXX
23-24 (07): XXXXXXX
25-26 (02): XX
27-28 (01): X
29-30 (01): X
31-32 (01): X
40 (01)
43 (01)
49 (01)
61 (01)
103 (01)

Alligator suggested I might calculate the 10th and 90th percentile figures for the number of days to complete each section as one tool to try to judge whether the distribution of hiking times was screwy for this population of hikers (thorough journal keeping thru-hikers). I've calculated them and here is a table with the mean days to complete each section, the median, and the range of days taken to complete for each section excluding the fastest and slowest 10 percent of hikers for those sections. In addition, I'm including the range of days to complete each section for the hypothetical 4 and 7 month hikes (a fast hike and a slow one) that are referenced in Table 4 of the article. I created these hypothetical hikes in the first place by the statistically crude method of a simple ratio (if a seven month hiker took 1.271 times longer to hike the whole hike than the mean of 168.4 days then that hiker would take 1.271 times longer to hike each section too). In table 4 in the article I just give cumulative day totals rounded off to whole days but in this table I use section totals rounded to tenths. I'm including them here to show they're so darn similar to the 10, 90 percentile figures:

(Tables A, B, C, and D are for the years 2001-2005 only; Table E is for 2001-2010)

Table A -- Range of Days to Complete Each Section

MEAN ~ MEDIAN ~MID 80% ~~~4#-7# ~~~~SECTION
7.95.........8.0........5.7-9.7..........5.8-10.1........Springer-Ga. Border
7.71.........7.7........5.7-9.7..........5.6-9.9..........Ga. Border-Fontana
24.34........24.........19-31...........17.7-31..........Fontana-Damascus
28.60........28........21.5-36.........20.8-36.5........Damascus-Waynesboro
11.32........11..........8-14............8.2-14.4........Waynesboro-Harpers
19.32.......18.5......14.3-23..........14.0-24.7.......Harpers-DWG
12.32.......12.5.......8.6-15...........9.0-15.7........DWG-Kent
23.11........23........18.3-29.........16.8-29.5........Kent-Glencliff
9.60..........9..........6.5-13...........6.9-12.2........Glencliff-Gorham
9.79.........9.9........7.3-12...........7.2-12.5........Gorham-Stratton
13.77........14.........11-16...........10.0-17.5.......Stratton-Katahdin
167.83......172.......137-197.........122-214.........For entire AT

Alligator felt that stating a realistic range for each section hiked, excluding the extremes on either end, might be a better way to state typical progress for TJKs in my study, as this might be more useful for hikers than just a single, simple flat number whether that number be the mean or the median (I hope I'm correctly stating his opinion, here). I'm thinking that hikers can get an idea of these kinds of realistic ranges by looking at the Table 4 hypothetical hike values for each section and see very, very similar ranges to those given in the 10, 90 % percentile calculation, but in a form that's more intuitive for them. If readers with concerns about statistical methods want more depth than I'm giving them in the article proper, I can always mention at the end of the article that some of that information is available in this thread which will be possible to access through the "Released Articles Forum," if the day comes when this article is accepted. This is an idea I'm entertaining anyway, and I'll be curious to see what others think.

ARambler and Alligator suggested I take a closer look at "Hiking Days," so here is a table similar to the one above, but for Hiking Days (excluding zero days) instead of Total Days. The table includes mean hiking days to complete each section, median, the entire range of days TJKs took for each section (least days and most), and the "Middle 80%" range figure excludes the ten percent of hikers taking the most and least days to hike:

Table B -- Range of Hiking Days to Complete Each Section

MEAN ~~MEDIAN ~~RANGE ~~~~~MID80%~~~SECTION
7.52.........(7.7)........(3.6-11.3)........(5.5-9)..........Springer-Ga. Border
7.36.........(7.3)........(3.3-10)..........(5.7-9.3)........Ga. Border-Fontana
21.17........(21)........(11.4-28).........(17-26)..........Fontana-Damascus
24.22........(25).........(15-33)...........(20-28).........Damascus-Waynesboro
9.61.........(10).........(5.9-15)..........(7.2-12).........Waynesboro-Harpers
16.04........(16)........(9.8-22.4)........(13-19)..........Harpers-DWG
10.62........(11).........(6-13.5)..........(8.3-13)........DWG-Kent
20.80........(21).........(11-28)..........(16.4-25)........Kent-Glencliff
8.62.........(8.5)........(4.2-12)...........(6-11)..........Glencliff-Gorham
8.81..........(9)..........(4.1-15)...........(7-11)..........Gorham-Stratton
12.87........(13)..........(7-20)...........(10.7-15).......Stratton-Katahdin
147.64......(150).......(85-199).........(121-171).......For entire AT

I need some time to digest a lot of what is in ARambler's second post before I respond to it, but Table A, above, might shed some light on a question he had in his first post. He wondered if the distribution of days to hike the last section spread out from previous sections as some hikers got VERY determined to get the hike over with while others wanted to linger so the experience wouldn't come to an end. I know what he means from reading people's journals. But to my surprise the range figure in Table A that shows the days to hike the last section seems to show the opposite. The percentage difference between the 10th percentile hiker and 90th percentile hiker in that section is smaller than in any other section of the AT.

LostInSpace asked about the relationship between start date and length of time to hike the AT. The following table breaks TJK's into five groups based on starting date, but unlike the distributions above, these dates I did tally before deciding how to group them. There were several breaks in the distribution of start dates on the calendar where it seemed natural to place boundaries defining the groups. It shouldn't be surprising that thru-hikers starting later in the season took less time to complete the trail given that with Baxter State Park closing in mid-October there was little choice. The quickness of the very earliest group, though, takes a little explaining. It's my belief, based on anecdotal evidence, that a lot of novice hikers cluster around the March 1 and March 15 dates for one reason or another, while many hikers leaving before this, a group knowing they have a lot of true winter hiking ahead of them, are more likely to be veteran hikers and it seems likely that as a group veteran hikers might take less time to complete the trail than novice hikers. That's one idea, anyway. In Table C I give both the mean and median number of days for each group to complete the trail (the number of hikers in each group is given in parentheses after the date range):

Table C -- Time to Complete Grouped by Start Date

DATE RANGE ~~~~~~~~~~~ MEAN ~~~~~~~~ MEDIAN
Feb. 27 or before (13)..............165.0....................157
Feb. 28-March 5 (19)...............177.6....................177
March 6-March 17 (30).............170.0....................174
March 18-April 8 (29)...............164.7....................172
April 9 or after (14).................159.1....................166. 5

Some asked about the relationship between number of days to complete the AT and zero days taken, specifically for the very fastest and slowest hikers. So I split the TJKs into six groups based on their percentile ranking for days to complete the AT: the fastest 10% of hikers, the hikers in the 70th to 90th percentile group, 50th to 70th, 30th to 50th, 10th to 30th, and the slowest 10% hikers. I looked at how long these groups took to complete, the number of hiking days (HDs), zero days, zero days taken in Short Term Breaks (STBs), and Long Term Breaks (LTBs). I also computed the percentage of the total days to complete that were taken in the three zero day categories:

Table D -- Relationship Between Time to Hike and Zero Days

PERCENTILE ~DAYS ~~HD's ~~ZERO DAYS ~~STB's ~~~~~LTB's
90-100...........113.9....106.2.....7.7 (6.8%).......6.5 (5.7%).....1.2 (1.1%)
70-90............147.9.....136.0.....11.9 (8.1%)......8.2 (5.6%).....3.7 (2.5%)
50-70............165.4.....144.9.....20.5 (12.4%)....13.9 (8.4%)....6.6 (4.0%)
30-50............175.5.....154.2.....21.3 (12.1%)....14.0 (8.0%)....7.3 (4.2%)
10-30............188.9.....165.5.....23.4 (12.4%)....14.6 (7.7%)....8.7 (4.6%)
0-10..............208.0....167.0.....41.0 (19.7%)....19.0 (9.1%)....22.0 (10.6%)

Table D shows that the difference between the very slowest group of thru-hikers and the next slowest bunch was all in the number of zero days they took. There's hardly any difference in hiking days at all. On the other hand, that big group of hikers ranging between the 10th and 70th percentiles were consistent about the percentage of zero days they took (between 12.1 and 12.4%), so the significant difference in this big block of mainstream thru-hikers tended to be the "miles per hiking day," and not frequency of zero days.

Finally, here's a chart comparing the ten different years in the study. Keep in mind, the study population for any one year is pretty small, so I don't think the values for any given year are nearly as reliable as the total values for the whole range of years, 2001 to 2010 ("HD" in the following chart means Hiking Days):

Table E -- Hikes Grouped by Year of Hike

YEAR ~MEAN ~MEDIAN ~HD MEAN ~HD MED~ZERO MEAN ~ZERO MED
2001.....173.9.....170.0.......149.3........146.0. .........24.6............24.0
2002.....175.1.....176.0.......153.7........152.0. .........21.4............22.5
2003.....167.5.....168.0.......148.3........151.0. .........19.2............21.0
2004.....169.9.....175.0.......149.4........150.0. .........20.5............17.0
2005.....158.3.....163.5.......139.8........142.0. .........18.6............17.0
2006.....169.7.....171.0.......148.3........149.0. .........21.5............20.0
2007.....164.9.....164.5.......145.9........147.5. .........19.0............16.5
2008.....170.3.....171.0.......146.5........151.0. .........23.8............21.0
2009.....172.0.....175.0.......151.9........153.0. .........20.2............20.0
2010.....172.3.....169.0.......150.9........153.0. .........21.4............24.0
Total.....168.8.....171.0.......148.1........150.0 ..........20.7............19.0

If the numbers for 2005 are representative, and not a quirk of the small number of TJKs involved, my speculation from having read so many journals is that the monsoons that hit New England around the time August was turning into September, and that pretty much lasted through the rest of the hiking season, prevented some late season hikers from completing the AT who might have finished their thru-hike in a more typical year. These late season hikers are by nature going to tend to have a higher number of days to complete their hike, so since this group might be fewer than in a normal year, this might account for why 2005's average "number of days to complete" total could be lower than a typical year.

**LostInSpace** · 02-16-2006, 01:15

Do the data show the start date as a significance factor?

**Alligator** · 02-16-2006, 01:35

Originally Posted by ARambler

2) I like to use the hiking days data separately. Also, especially from a statistical point of view, you should not combine hiking and not-hiking without testing for independence. What does a simple plot of the number of zero days versus the number of days hiked look like? I would guess that the slope below the mean would be more than proportional, i.e. reducing the days hiked in half from 148 to 74 would reduce the zero days by more than from 20 to 10. (This would presumably be an extrapolation of the line.) However, at the upper end, The slope might be less than proportional. A hypothetical hiker doing 148+74 = 222 hiked days, might have to reduce the number of zero days to complete the hike before winter. Therefore, Table 4 could be off a little.

Yes. Are the zero days and hiking days correlated? Probably at least weakly positive.

Originally Posted by ARambler

4) Skewed distributions: Thanks for the detailed mean versus median data. It is interesting to see how the mean catches up to the median. I guess the data show that the people who start very late, have to catch up relative to the median.
It does not surprise me that the total days distribution is skewed to the left and consequently the median is higher than the mean.

That doesn't mean its skewed. A single observation can create the difference. Imagine a nice symmetrical distribution where the mean and median coincide. Now, take one observation just left of the median and move it to the far left. It won't alter the shape of the distribution to any great extent, yet it will drag down the mean.

Originally Posted by ARambler

5) Outliers: I'm surprised outliers did not seem to be a concern to you. In 2005, Apple Pie left the trail in Erwin for about 50 days. This is over half of the LTB you report for the Fontana to Damascus section. Similarly, Stumpknocker took almost 365 days to hike the trail in 2004, but he hikes at a 4 month pace. (Hippy LS also did 360+ days but I don't think she had a complete TJournal. FB & Silver Girl took > 80 days off but their journal was not on Trailjournals.) I don't have much of a point to make about outliers, they are a part of life and a part of life on the AT. However, I think they are more a factor affecting zero days, and that's another reason for separating out zero days.

6) Variability:
a) By far the most common expression of variability is the standard deviation, or variance=std.dev. squared. It should be calculated using a spreadsheet of the standard deviation function on a programmable calculator. If you have to calculate it by hand:
Std.Dev^2 = [sum of each (day squared) - n*Ave^2]/(n-1) = [sum(Di*Di) - 2,957,525]/104; where Di = total days for each hiker, i, and 2,957,525 = 105 hikers*167.83*167.83 average days. For the Hiked Days it would be: Std.Dev^2=[sum(di*di) - 2,290,295]/104; Note, 105*147.69*147.69 = 2,290,295.
I hope you have the Excel function for sample standard deviation.
b) The easiest calculation for variability is range; just the longest minus slowest days. I believe one must assume a normal distribution to convert Range to (unbiased) variance (std.dev^2).

A rough guide is to divide the range by four to get an estimate of the population standard deviation. This is based on normality.

Originally Posted by ARambler

c) The other commonly used expression for variability is a confidence range. The most common range is a 95 % interval which for a normal distribution is about plus or minus 2 std. dev. from the mean. I recommend against using these confidence intervals with such skewed distributions. Note, the 95% confidence range means 2.5 % faster and 2.5 % slower.

No, a 95% confidence interval for the mean is saying that if you took a sample of size n repeatedly, 19 times out of 20 the mean would be in that confidence interval. It refers to where the mean lies, not what observations are in the tails. Are you missing a chromosome or what?

Originally Posted by ARambler

Since the normal assumption makes the estimates symmetric, the confidence interval is often expressed as Average +/- Interval/2. e.g. 168 +/- 30 days. for a std dev about 15 days.

A 95% confidence interval for the mean would be mean+/- 2(sigma/n^0.5). In words, the mean plus or minus 2 times the standard deviation divided by the square root of the sample size. The value sigma/n^0.5 is referred to as the standard error. It is entirely reasonable to assume normality, as the distribution of the sample mean converges to the normal with large sample sizes. In this case, it is 105. It hasn't been demonstrated what the distributions for the numbers of hiking days for the sections or the whole trail looks like. They may not be noticeably/significantly skewed.

Originally Posted by ARambler

d) You could report the actual % interval as a pseudo-confidence interval. Just figure out the number of days that 2.6 hikers were slower and another number which 2.6 hikers were faster. What has been proposed is reporting the lower and upper numbers of -10% and +90 %. For the section data, I think you will find it difficult to interpolate between whole days for the -10%/+90% number, which in my mind is an arbitrary, non-standard percentile, pseudo-confidence interval.

It is arbitrary. It is the idea behind a trimmed mean, where the influence of outliers has been removed. If one were to take the 10th, 50th, and 90th percentiles in each section, the slowest and fastest hikers would be excluded, the median would serve as the the measure of central tendency, and no distributional assumptions would be violated. 5, 50, and 95 could be used also.

Originally Posted by ARambler

e) If you remove the variability associated with the zero days, you might be able to give a good representation for hiking variability just by reporting aggregate data. This data might also be easiest to understand and use in a statistic free way. I propose to aggregate the data for each section into five groups. Because the distances vary by such a large amount, the intervals for the groupings should also vary. I suggest that each of the five groups vary by m=1, 2, or 3 days. You would then report 8 values/section: g1.days, m, n.g1, n.g2, n.g3, n.g4, n.g5, Slow. I'm not sure whether the g1.days should be integer and the start of the interval. Assuming that it is, you would get numbers like:
5, 1, 12, 21, 32, 19, 11, 2. For the first section, 12 hikers would reach the GA line in 5.0 to 5.9 days, 21 hikers would reach the border in 6.0 to 6.9 days, 32 hikers in 7.0 to 7.9 days, 19 hikers in 8 to 8.9 days, 11 hikers in 9.0 to 9.9 days and 2 hiker over 9.9 days (optional). By calculation, 105-(12+21+32+19+11+2)=3 hikers less than 5.0 days. The relative distribution for the Damascus to Waynesboro will not be exactly the same, but if it was, the data would be reported as 17, 3, 12, 21, 32, 19, 11, 2. and the groupings would be: 17 to 19.9 days, 20 to 22.9 days, 23 to 25.9, 26 to 28.9 days, and 29 to 31.9 days. Slow hikers would look at this raw data and see 11 in 105 needed 8- 8.9 days food to reach the GA border and 23 to 25.9 days to get to Waynesboro, and would plan on packing this amount. (Hopefully, not all at once.)

What you are describing are the bins of a histogram. A histogram would be easier to follow. A boxplot for each section of the number of hiking days it took would be even better.

**Alligator** · 02-16-2006, 01:57

Originally Posted by domnokmis

I thought he was being pretty jerky, myself. And I'm not in academia, so you are obviously correct. Perhaps if I were more insulated from the practical, I could be more picky.

I'd consider your having your head surrounded by your ass to be well insulated from the practical. How's that for jerky?

Originally Posted by domnokmis

But as far as I can tell, he applied the study to something the author did not extend it to, then said, it can't be used for this purpose. Duh.

I didn't apply the study to anything. I expressed reservations about where it could be applied. Duh.

Originally Posted by domnokmis

Besides, you can take ANY study and trash it as he did.

No, I didn't trash it. I suggested that there could be bias present due to the way the sample was selected.

Originally Posted by domnokmis

Different years might make a difference? Sure so the author coorelates them by year. Ah, but he didn't combine dry years and rainy years, did he. So he does. Ah, but one of his rainy years was really a dry year with a hurricane that skewed the numbers. So he puts it with dry years. Ah, but it was wet by definition of # of inches of rain. So the author puts it with the wet years. Ah, but it was a dry year with a hurricane.

So the author drops the year in question.

AH HA! Now you are selecting data!!!!!! Bad bad bad.
...

No, what the author could try is to see if the the means by year are similar or dissimilar. If they are dissimilar, this would certainly suggest that grouping was inappropriate. But hey, you know what, time is so static and such an unimportant factor in science that you might as well just throw it out. I mean nothing ever changes with time right? I'm sure that 3 weeks of rain, a heavy March snow, and unseasonably warm temperatures have absolutely no effect on a hiker's progress. And these things happen year in and year out exactly the same too.

**Alligator** · 02-16-2006, 09:44

BTW domnokmis, is it Frosty I'm speaking with or Mrs. Frosty?

**ARambler** · 02-16-2006, 16:16

Map man:
Thanks for your analysis and follow-up.
1) I take it from your methodology that you have not calculated the actual hiked days/section for any individual hiker. You just have just subtracted the total zero days from the total days to get Hiked days. So, you will need to start from your raw data to analyze the non zero day hiking rates. I think it would still be useful to plot zero days versus total days (including zeros). Not sure how to present it in this forum.

2) The easiest way to get a picture of non-zero data is to look at the non-zero hiking day Range for each section. You would need to go down both tables and subtract the zero days from the total days and compare that to the interim min and max for that section. Note, you already have the average number of non zero days for each section (my earlier post recalculated it from your %zero and mean data).

3) I found your Mid 80% data very interesting. I assume you are saying 10 to 11 hikers were below this range and 10 to 11 hikers above this range. For the entire AT, the lower bound, 137 is 31 days below the original mean and the upper bound, 197 is 29 days above the mean. I don't know how to statistically test this, but the distribution (without the 10% tales) seems only slightly skewed (about the original mean). Also, the sum of the section ranges gives an "upper and lower time" of 125.9 and 208.4 days. So, the same 21 hikers were not in the tails for all of the sections. (What I would expect.)

Alligator
Thanks for your comments.
1) I'm generally reluctant to trim the data to reduce outliers. In this case, I think outliers may be created by long stretches of zero days. So, although it is arbitrary and I don't know how to handle it statistically, your mid 80% range seems good to me.

2) If map man can provide the total range per section for the non-zero day data, I will be more willing to assume a normal distribution. What tests would you want to do before you make that assumption? If we are at least tentatively willing to assume the data is normal, I think we would divide the range by about 5 (for a large set of 105 data) to get an estimate standard deviation.

3) I'm sure I used the term "confidence interval" in a non standard manner. There are only one or two sample sizes of interest, 105 for the Trail journals data and possibly 1 for the readers expected value. I'm not sure what you are talking about a sample size, n.

4) This following quote is wrong but I'm sure you misunderstood. Statisticians must always be careful about confusing cause and affect. I said I guessed the total distribution was skewed to the left, e.g. more hikers would complete the trail in less than 137 days (31 less than the mean), than would hikers complete the trail in 199 days which is 31 days greater than the mean. This seems true based on map man's data, but I'm not so sure it's statistically significant. None-the-less, for all distributions that I have seen, distributions skewed to the left have a mean is less than the median. Do you have any real life counter examples? What criteria do you use to conclude the parent distribution is skewed?

Originally Posted by Alligator

That doesn't mean its skewed. A single observation can create the difference. Imagine a nice symmetrical distribution where the mean and median coincide. Now, take one observation just left of the median and move it to the far left. It won't alter the shape of the distribution to any great extent, yet it will drag down the mean.

...

What you are describing are the bins of a histogram. A histogram would be easier to follow. A boxplot for each section of the number of hiking days it took would be even better.

Thanks for the support on Histogram data.
Rambler

**Alligator** · 02-16-2006, 18:43

Originally Posted by ARambler

Map man:
Thanks for your analysis and follow-up.
1) I take it from your methodology that you have not calculated the actual hiked days/section for any individual hiker. You just have just subtracted the total zero days from the total days to get Hiked days. So, you will need to start from your raw data to analyze the non zero day hiking rates. I think it would still be useful to plot zero days versus total days (including zeros). Not sure how to present it in this forum.

I too would seek to remove the zero days. Personally, I would hypothesize that zero days are a function of days hiked plus some highly random amount. I think it could be hard to model zero days because not only are they used for rest, washing clothes, and resupply, but for *** occurences, nice camp areas, pink blazing, etc.

Originally Posted by ARambler

Alligator
Thanks for your comments.
1) I'm generally reluctant to trim the data to reduce outliers. In this case, I think outliers may be created by long stretches of zero days. So, although it is arbitrary and I don't know how to handle it statistically, your mid 80% range seems good to me.

I don't like to take out outliers either, I prefer to accomodate them. If I was felt like giving myself a headache, I might try some M-estimators or other robust techniques.

Originally Posted by ARambler

2) If map man can provide the total range per section for the non-zero day data, I will be more willing to assume a normal distribution. What tests would you want to do before you make that assumption? If we are at least tentatively willing to assume the data is normal, I think we would divide the range by about 5 (for a large set of 105 data) to get an estimate standard deviation.

I would be interested in comparing the empirical data to theoretical distributions through Q-Q plots. If the data is compared with a normal distribution, based on the shape of the plot, skewness and heavy light tails can be demonstrated. Examination of boxplots will also suggest distribution. Shapiro-Wilks, Anderson-Darling, Lilliefors, other goodness-of-fit tests will give a numerical answer. Generally, many of the statistical procedures I use are robust to departures from normality. When examining residuals, I'm generally satisfied with "good" adherance to a normal probability plot. A lot of the numerical tests are not perfect anyway.

For whatever group you wish, if you want an estimate of the standard deviation, just use s, the sample standard deviation. I only mentioned the range/4 method because you had mentioned equating the range to the s.d.

Originally Posted by ARambler

3) I'm sure I used the term "confidence interval" in a non standard manner. There are only one or two sample sizes of interest, 105 for the Trail journals data and possibly 1 for the readers expected value. I'm not sure what you are talking about a sample size, n.

I was using the general formula for computing a confidence interval for the mean. You stated the confidence interval as approximately +/-2 times the s.d. Follow this link. http://davidmlane.com/hyperstat/B7483.html
What was missing is you left out the division by the square root of n, the sample size. Here n=105.

Originally Posted by ARambler

4) This following quote is wrong but I'm sure you misunderstood. Statisticians must always be careful about confusing cause and affect. I said I guessed the total distribution was skewed to the left, e.g. more hikers would complete the trail in less than 137 days (31 less than the mean), than would hikers complete the trail in 199 days which is 31 days greater than the mean. This seems true based on map man's data, but I'm not so sure it's statistically significant. None-the-less, for all distributions that I have seen, distributions skewed to the left have a mean is less than the median. Do you have any real life counter examples? What criteria do you use to conclude the parent distribution is skewed?

I agree with you that a skewed left distribution will have a mean less than the median. The converse is not necessarily true for sample data. Having a mean less than the median does not always mean skewness. Due to random variation, it is entirely possible to have a normal distribution where the mean and median do not coincide. Even more so if you are measuring in discrete units like days and taking means down to the tenths or hundredths place. In fact, I entirely expect the mean and median to not be exactly the same in a sample. In Map Man's data, four of the sections have means less than the median, the other seven above. The overall mean is less than the median.

Originally Posted by ARambler

What criteria do you use to conclude the parent distribution is skewed?

Compare in a Q-Q plot to a standard normal, as stated previously. Probably compare it to any theoretical symmetric distribution and it ought to show skewness. The way the plot deviates will tell you how the distribution is different: left and right skew, heavy and light tails. Honestly, I always have to look up the shapes as a I forget which is which. Alternatively, guess the parameters correctly for a hypothesized skewed distribution, and look for a match on the Q-Q plot. That's much harder though.

Originally Posted by ARambler

Do you have any real life counter examples? See attached. The data are red maple tree heights from the same plot. I mostly need to check residuals for normality, so I didn't have anything specific handy. This was however, the second data set I pulled. The data are normal by Anderson-Darling goodness of fit, but the mean is 69.49 and the median 70. The slight variations between the mean and median can be more pronounced in smaller data sets.

...

I too would seek to remove the zero days. Personally, I would hypothesize that zero days are a function of days hiked plus some highly random amount. I think it could be hard to model zero days because not only are they used for rest, washing clothes, and resupply, but for *** occurences, nice camp areas, pink blazing, weddings, funerals, etc.

Sorry to keep you in the crosshairs there Map Man. I would be happy to look at the hiking days per section in order to answer some of my own questions. What I would be interested in doing is to plot the number of hiking days by sections and comparing the distributions across years and across start dates. Really just a visual check. If you want that is, it's your data.

**map man** · 02-16-2006, 23:43

From this point forward I'm planning to post the detailed responses I have to member's questions and suggestions that deal with further elaborating the data or how better to employ statistical methods for this study at Post #28 of this thread. But I still want to hear from and respond to anybody who has anything to say about the article at all.

And speaking of responses, although I've been trying to address people who have specific suggestions to improve the article, there's a long list of people who've been letting me know that they enjoyed reading the study, so I want right now to say, "thank you."

**Jack Tarlin** · 02-16-2006, 23:54

I think this is all fascinating stuff; thanks for working on this and sharing it with us.

**map man** · 02-19-2006, 21:02

I've updated post #28 of this thread by editing into it some new info that some WhiteBlaze members requested.

**ARambler** · 02-20-2006, 00:00

Thanks for the continued updates.
Rambler

**FiveWay** · 02-20-2006, 23:22

Great work. Thanks for the taking the time. As I finish my prep work for my 06 Thru-hike this help me to look at what I had planned and what to expect. FiveWay

**map man** · 03-20-2006, 23:34

I'm curious now that I've made several edits and additions to the main article over the last few weeks whether people think the article "works." Is it useful? Is the information presented in as clear a fashion as possible? Is it accessable -- that is, does it avoid being too complicated and avoid only appealling to statistics nerds like myself? I've arrived at the idea of referring readers to the article thread (specifically, post #28) for a more in-depth discussion of the data. Does that seem like a good idea? For those who've brought up issues concerning the statistical methods used, have your concerns been addressed in either the article or post #28? And finally, does the article in its present form meet the standards that an article here at WhiteBlaze ought to meet?

Now that the weather is nicer here in my neck of the woods (Iowa), I'm out hiking on weekends instead of messing around with the raw data used in the study like I was this winter, but I'd still like to hear from people if you have any suggestions at all to make the article better.

Thread: Article: AT Hiking Rates, Section by Section

Thread Tools

Search Thread

Display

Posting Permissions