当前位置:文档之家› Identification of InhomogeneousShort Sequences in the Chronic Hepatitis Dataset Based on th

Identification of InhomogeneousShort Sequences in the Chronic Hepatitis Dataset Based on th

Identi?cation of Inhomogeneous/Short Sequences in the Chronic Hepatitis Dataset

Based on the Variance of Sampling Interval

Shoji Hirano and Shusaku Tsumoto

Department of Medical Informatics,Shimane University,School of Medicine

89-1Enya-cho,Izumo,Shimane693-8501,Japan

E-mails:hirano@https://www.doczj.com/doc/c711390235.html,,tsumoto@https://www.doczj.com/doc/c711390235.html,

Abstract.This paper presents a criterion for identifying inhomoge-

neous sequences in time-series medical datasets based on the variance

of sampling interval,duration of data acquisition and number of data

points.After removing inhomogeneous and short sequences from the data

set,we performed the analysis of the hepatitis dataset again and obtained

the following interesting?ndings regarding the stage of liver?brosis and

the speed of platelet(PLT)decrease:(1)In F2cases,PLT reaches ab-

normally low level faster than in F3,(2)In type C hepatitis patients who

did not receive the interferon(IFN)treatment,PLT reaches abnormally

low level slower than in those who received the IFN treatment for all of

the?brosis stages.

1Introduction

One of the remarkable characteristics of longitudinal time-series medical data is that the granularity of data points can vary from time to time.This is mainly because of the following two reasons:(1)date of laboratory examinations can be ?exibly arranged according the condition of the patient,(2)examinations should be optimized in order to minimize medical costs–hence the sampling interval of data would be irregular.However interestingly,the irregularity itself conveys potentially interesting information,which enables us to infer the patient’s condi-tion.For example,dense examination may re?ect the patient’s severe condition, and sparse examination may re?ect long-term observation after?nishing the medical treatment.via clustering.

Re-sampling is a common and important procedure in time-series analysis since it provides a way of uniformalizing the irregularly sampled,inhomoge-neous data.However,in time-series medical data analysis,it should be carefully done by con?rming that the important information in the original data will not be lost or degraded by the re-sampling.Suppose we have an examination data of a patient acquired every one week for one year,and we have another,the samely sampled data of the same patient at?ve years later.If we simply connect the two data,construct1+5+1=7years data and then perform re-sampling,im-portant information hidden in the dense examination period of the?rst and last

years may be degraded by the?at and large population data during the inter-mediate?ve years.Hence,one should carefully observe the data and determine whether the data can be simply connected and interpolated or the data should be discarded due to the excessive inhomogeneity.Such an observation should be performed prior to the analysis,however,it would be di?cult if the amount of data is considerably large.Therefore development of a system that alerts the existence of inhomogeneous data and facilitates re-examination of the data is required.

Our research aims at developing a data mining system that supports knowl-edge discovery from inhomogeneous time-series medical data.As a core of the system,we developed a hybrid technique of multiscale matching[1,2]and rough clustering[3]for comparison,grouping and visualization of the time-series.We then have applied the method to chronic hepatitis dataset and discovered some interesting patterns in GPT sequences that may represent the e?ectiveness of the interferon treatment,and interesting relations between the decrease speed of platelet count and the stage of liver?brosis[4].However,it also revealed that there exist considerable amount of short and inhomogeneous sequences in the dataset,which decreased the separability of clusters and corrupted the statistical analysis performed on the clustered sequences.

This paper presents a novel method of discriminating short and inhomoge-neous sequences based on the variance of sampling interval.Re?ned results of data analysis,that investigate the relationships between the speed of platelet decrease and the stage of liver using the cleaned dataset,are also presented.

2Identi?cation of inhomogeneous and short sequences 2.1Results of previous analysis and observed problems

During the last to years,we have investigated the relationships between the de-crease speed of platelet count and the stage of liver?brosis,aiming at?nding the possibility of substituting the invasive liver biopsy by the standard routine blood tests.The approach included clustering and visualization of the platelet se-quences,and statistical analysis of the clustered sequences.First,we summarize the results and problems found until last year.

The subject was the hematological data added to the hepatitis dataset on October2002.This hematological data included seven examination items:red blood cell count,white blood cell count,platelet count,hemoglobin count,hema-tocrit count,mean corpuscular volume,mean corpuscular hemoglobin,and mean corpuscular hemoglobin concentration.From this data we have extracted the platelet count(PLT)sequences and constructed a time-series dataset of PLT.

We have removed222sequences from a total of720sequences in the dataset, because no biopsy information was available for them.The remaining498se-quences were strati?ed into three groups according to the virus type(B or C) and administration of the interferon(IFN)treatment.For each of the three groups we have applied the multiscale matching and rough clustering,and visu-ally inspected the resultant clusters.

IFN #945 (F2)

Norm H Norm L #592 (F1)

Biopsy 0510 (Y)

Fig.1.An example of a sequence in which PLT counts turned to increase after IFN treatment.

For the group of type C hepatitis with interferon treatment,some interesting clusters of PLT sequences taking similar temporal courses were generated.Fig-ures 1and 2show representative examples of sequences in those clusters.The sequences in Figure 1presented a common pattern in which PLT count turned to increase after ?nishing the IFN treatment.It may represent the case in which the liver function was recovered by the treatment,resulting in keeping PLT count within the normal range.On the contrary,sequences in Figure 2presented a common,long-term pattern in which PLT count continued to decrease even af-ter ?nishing the IFN treatment.These pattern induced a hypothesis that,if IFN treatment is e?ective,PLT count keeps the normal range,and if IFN treatment is not e?ective,PLT count continues to decrease and ?nally the patients faced to the risk of continuous bleeding.

Then we focused on the cases where IFN treatment was ine?ective,and further investigated the relationships between the weeks spent for reaching ab-normally low PLT count and the degree of liver ?brosis.While this process,it became apparent that the short and inhomogeneous sequences corrupted the statistics.Therefore we have manually selected sequences that were considered to enough represented the natural courses of the chronic hepatitis and recal-culated the statistics.The results were:F1:0-15years,F2:0-10years,F3and F4:0-8years,which almost followed the natural course of type C chronic viral hepatitis without the IFN treatment.It suggested that (1)if IFN treatment was ine?ective,then the patient takes a clinical course that is similar to the natural case and the PLT count continues to decrease (2)the more ?brosis proceeds,the faster PLT count reaches abnormally small.

In order to validate these ?ndings,we performed a statistical analysis using all data.Table 1shows the results.For both type B and C cases,it demonstrated

#743 (F3)

#582 (F1)

Fig.2.An example of a sequence in which PLT counts continued to decrease after IFN treatment.

that the order of PLT counts followed that of?brotic stages.For cases in which information about virus activity was available,we further examined the relation-ships between?brotic stages,virus activity and years for reaching abnormally low PLT counts.Table2provides the results.It demonstrated that,the more ?brosis proceeds,the faster PLT count reaches abnormally low level.And if the ?brotic stages are the same,the higher virus activity is,the faster PLT count reaches abnormally low level.

The above results suggested that the platelet count may be used as a mea-sure for predicting?brotic stages and long-term course after the IFN treatment. However,this analysis also revealed the following problems.

1.Short-term decrease of PLT count induced by IFN treatment

If IFN treatment induces myelosuppression,the ability of generating new blood is inhibited,resulting in the decrease of blood cell counts.In the hep-atitis dataset,we observed many sequences that included short-term decrease of PLT count and its length exactly matched the period of IFN treatment (about6months)This short-term decrease could lead to the imprecise de-tection of the marker point at which PLT count becomes lower than the normal level.Figure3shows an example of such a case.In this case,short-term decrease caused the detection of the pseudo marker point two years earlier than the expected point.

2.Pseudo trend caused by the interpolation of long-term missing values

Examination data about chronic diseases can be acquired over10-20years.

Even if there is continuous hiatus of the data for several years,one can recognize the long-term trend of data by visually interpolating missing values using available data values.However,such a long-term interpolation involves

Table1.Fibrotic stage and PLT count

Type N Fibrotic stage Mean PLT count SD PLT count

B61F1206.7651.79

B51F2173.4544.40

B25F3163.8445.91

B22F4114.7350.10

C21F0232.7663.48

C38F1186.8354.19

C81F2150.8547.58

C67F3137.4544.41

C62F4123.7645.00

Table2.Fibrotic stages,virus activity,and weeks for reaching abnormally low PLT count

F1A26 3.36 3.95

F2A11––

F2A216 2.87 3.25

F3A11––

F3A25 3.67 3.59

F3A390.68 1.02

F4A11––

F4A2150.88 2.27

F4A390.080.17

a risk of introducing a pseudo trend.Especially,if the interpolated section

intersects the lower boundary of normal PLT range,the credibility of the place of a marker point will be substantially degraded.Figure4shows an example of this kind of inhomogeneous sequence.

2.2Identi?cation of inhomogeneous and short sequences

As mentioned in introduction,the temporal irregularity is one of the remarkable feature of the longitudinal time-series medical data.It makes the data analysis more di?cult compared to the case of regularly-sampled,equal-length data.For example,missing values may cause the generation of pseudo trend,and non-uniformity of the data makes it di?cult to discriminate between a short but dense sequence and a long but sparse sequence using sequence length.

The temporal granularity of data required for analysis depends on the types of diseases and lengths of events.Based on the observations in the previous section,we attempted to build the criteria for identifying inadequately short or inhomogeneous sequence in the context of chronic hepatitis dataset analysis.We focused on the following three conditions.

#498Norm H

Norm L

IFN 051015 (Years)

Fig.3.Short-term decrease of PLT count induced by IFN treatment.

051015 (Y)

Fig.4.Pseudo trend caused by the interpolation of long-term missing values I Duration of data acquisition:

Exclude short sequences inadequate for chronic disease analysis.We excluded the sequence if duration between the ?rst and last examinations was shorter than ?ve years.

II Interval of data acquisition:

Exclude inhomogeneous sequences in which one or some intervals of data ac-quisition are distinctively long.Focusing on the variance of data acquisition,we excluded the sequences if mean +SD of the sampling intervals of the sequence was longer than three years.

III Number of data points:

Exclude short sequences containing only a few data points.We excluded a sequence if the number of data points are less than three.

The value of ?ve years in condition I is rather intuitive and is not based on any statistical evidences or medical constraints.Figure 5shows the distribution of duration of PLT data acquisition.Although there are two large peaks at one and ten years,no other remarkable peaks existed therein.Choosing ten years as a threshold value may involve the risk of missing the fast decrease of PLT count on a patient of higher ?brotic stage.Therefore,we chose ?ve years as a threshold,just before the small minimum at four years.

The value of three years in condition II was de?ned because of the following reasons.Figure 6and 7respectively show the histograms of the mean and SD of the sampling intervals for all patients.According to Figure 6,most of the sampling intervals were less than 20weeks (5months),and the median was

010

20

30

4050607080

90

05101520

25

N u m b e r o f P a t i e n t s

Duration of Data Acquisition (Years)Fig.5.Histogram of the duration of PLT data acquisition

050

100

150200250

300

020406080100120

N u m b e r o f P a t i e n t s

Intervals of Data Acquisition (Average; Weeks)

Fig.6.Histogram of the mean value of PLT data acquisition intervals.Ave=11.56,SD=23.79,Median=6.24(weeks).

about six weeks (1.5months).A small peak existed at ?100weeks (2years);this seemed to represent cases in which patients once stopped coming to the hospital and came back again several years later for curing some health problems,or,patients received examinations every 1-2year.Because the contiguity of the data points became very low in these cases,they should be removed from the analysis.

Similar distribution was observed in the histogram shown in Figure 7.A remarkable feature was that 21cases (3%)exceeded 100weeks (2years),and 6of them exceeded 200weeks (4years).These corresponded to the cases in which dense examinations with 1-2weeks intervals were performed during some short periods,and intervals between these periods were very long.An example of such an inhomogeneous examination sequence is shown in Figure 4.

Taking these observations into account,we made a new measure for evalu-ating the inhomogeneousness of sampling intervals.The measure was de?ned as a sum of mean and SD of the intervals,so that it evaluates both the variance

050

100

150200250

300

020*********

120

N u m b e r o f P a t i e n t s

Intervals of Data Acquisition (SD; Weeks)Fig.7.Histogram of the SD value of PLT data acquisition intervals.Ave=17.85,SD=40.94,Median=6.11(weeks).

0100

200

300400500

600

050100150200250300350400450500550

N u m b e r o f P a t i e n t s

Intervals of Data Acquisition (Average+SD; Weeks)

Fig.8.Histogram of the mean+SD value of PLT data acquisition intervals.Ave=29.41,SD=63.0,Median=12.88(weeks).

and size of the intervals.Figure 8shows the histogram obtained using the sum of mean and SD of the variances.The ?gure shows that,95%of the data fell into the range of ?100weeks (2years),and 97%of the data fell into the range of ?150weeks (3years).Therefore,we chose three years as a threshold value for inhomogeneousness.

Based on the above three criteria,we removed short and inhomogeneous sequences from the dataset and performed again the analysis of relationships between PLT count and the progress of liver ?brosis.Similarly to the previous analysis,222of 720cases were ?rstly removed from analysis because of the lack of biopsy information.For remaining 498cases,we applied the above selection criteria.The results are tabulated in Table 3.The order of applying the criteria was III,I,II,because the calculation of variances for few points data is mean-ingless.Of 498sequences,169were excluded,and the remaining 329were used for analysis.Constitution of 329cases about the virus type and administration

Table3.Results of inhomogeneous/short PLT sequence identi?cation.

Type of sequence Number of cases

Short(condition I)149

Inhomogeneous(condition II)13

Few points(condition III)7

Other(Ok)329

Table4.Weeks for reaching abnormally low PLT level,strati?ed by?brotic stage

Weeks(for all cases)Weeks(for non-zero cases)

n Mean SD n Mean SD

F01680-0--

F127262.5197.025308.1176.5

F219165.0171.813241.2156.1

F326174.6200.418252.2195.8

F441110.0159.825180.5171.2

of the IFN treatment was as follows:Type B:109cases,Type C without IFN: 60,Type C with IFN:160.

As a preprocessing,we?rst applied linear interpolation with one week interval to each of the329sequences.Then we convoluted each sequence with a Gaussian kernel of6-month width in order to suppress noisy,high-frequency changes.After these processing,we detected the marker point from which PLT count falls into the abnormally low range and keeps it at least for6months,and calculated weeks from the?rst date of examination to the maker point.If there is any overlaps between the period of keeping abnormally low PLT count and the administration period of IFN treatment,we discarded that marker point and sought for the next one in order to minimize the e?ect of short-term decrease of PLT induced by IFN treatment.

The results are tabulated in Table4.For215sequences,no maker point was detected because PLT count did not reach below the normal range.Therefore,ta-ble4provides the summary for the remaining114cases strati?ed by the?brotic stage.The are some special cases in which PLT count kept abnormally low level at the?rst date of examination.The left column of the table shows summary of the data including these’0week’cases,and the right shows summary excluding those cases.Both summaries demonstrate that F4had the shortest average pe-riod of3to4years.Interestingly,the order was F4

Table5.Relationships between weeks for reaching abnormally low PLT level and ?brotic stage,strati?ed by virus type and administration of IFN treatment(all cases)

Weeks(B)Weeks(C,noIFN)Weeks(C,IFN)

n Mean SD n Mean SD n Mean SD

F14153.5196.911369.7213.012200.5141.1

F28126.9171.53277.3240.28161.0151.2

F39140.8177.34344.3312.813145.8164.0

F41066.0124.514145.6161.417106.7177.7 Table5tabulates the result of strati?cation of the left cases in Table4by virus type and administration of IFN treatment.Although it also required further investigation and validation,the result demonstrates that,for all stages,B< C+IFN

3Conclusions

This paper has presented a criterion for identifying inhomogeneous sequences in time-series medical datasets based on the variance of sampling interval,duration of data acquisition and number of data points.After removing inhomogeneous and short sequences from the data set,we performed the analysis of the hepatitis dataset again and obtained the following interesting?ndings regarding the stage of liver?brosis and the speed of platelet(PLT)decrease:(1)In F2cases,PLT reaches abnormally low level faster than in F3,(2)In type C hepatitis patients who did not receive the interferon(IFN)treatment,PLT reaches abnormally low level slower than in those who received the IFN treatment for all of the?brosis stages.The future work include clinical validation of the?ndings and clustering experiments using the re?ned dataset.

Acknowledgments

This work was supported in part by the Grant-in-Aid for Scienti?c Research on Priority Area(B)(No.759)“Implementation of Active Mining in the Era of Information Flood”planned research“Development of the Active Mining System in Medicine Based on Rough Sets”(#13131208)by the Ministry of Education, Culture,Science and Technology of Japan.

References

1.N.Ueda and S.Suzuki(1990):A Matching Algorithm of Deformed Planar Curves

Using Multiscale Convex/Concave Structures.IEICE Transactions on Information and Systems,J73-D-II(7):992–1000.

2. F.Mokhtarian and A.K.Mackworth(1986):Scale-based Description and Recogni-

tion of planar Curves and Two Dimensional Shapes.IEEE Transactions on Pattern Analysis and Machine Intelligence,PAMI-8(1):24-43

3.S.Hirano and S.Tsumoto(2003):An Indiscernibility-based Clustering Method

with Iterative Re?nement of Equivalence Relations-Rough Clustering-.Journal of Advanced Computational Intelligence and Intelligent Informatics,7(2):169-177.

4.S.Hirano and S.Tsumoto(2003):Multiscale Analysis of Long Time-series Medical

Databases.Proc.AMIA Annual Symposium2003,Washington DC,289-293(2003).

相关主题
文本预览
相关文档 最新文档