Show simple item record

dc.contributor.authorKakarla, Yamani
dc.date.accessioned2018-08-29T16:24:47Z
dc.date.available2018-08-29T16:24:47Z
dc.identifier.urihttp://hdl.handle.net/10222/74167
dc.description.abstractIn research that involves medical records, it is important that patient-identifiable details are removed before the records are made available for research, a requirement enforced by the HIPAA Privacy Rule and Public Law 104-191. De-identification is the redaction or masking of individually identifiable pieces of patient health information (PHI) from the clinical notes to protect the patient's identity from being exposed. With an increasing adoption of electronic health records (EHRs) in healthcare industries, there is an increasingly large amount of medical information available in digital format. Performing de-identification on such large collections of records is a challenging task to complete manually. Automated de-identification systems address this issue by automatically tagging the free-text medical records. The primary objective of this research is to explore automated techniques in natural language processing for de-identifying unstructured health records. To facilitate studies in automatic de-identification using statistical models, my work provides an overview of the evaluation results of a core NLP based de-identification model. My thesis describes the complexities in learning the variants of the model in the parameter space, explains performance metrics (precision, recall, and F1 measure) of the models, compare results with a rule-based de-identification system and finally provides directions for future research. The data used for evaluation consisted of three different types of medical notes: discharge summaries, longitudinal medical records, and nursing notes. Through model-specific feature engineering and introduction of hidden neural gates (model parameter) to the core model, a highest tag-level F1-measure of 0.967 on discharge summaries was achieved. For this task, in cases where more importance should be given to precision, the F1 measure can over-weight recall. The performance results from all models are encouraging and provide scope for future work. Overall this thesis intends to increase practitioners' understanding of the nature of de-identification models and how they are trained, to help preserve medical information while not compromising the privacy of individuals.en_US
dc.language.isoenen_US
dc.subjectPrivacyen_US
dc.subjectDe-identificationen_US
dc.subjectProtected Health Informationen_US
dc.subjectHIPAAen_US
dc.subjectNatural Language Processingen_US
dc.subjectSequence Labellingen_US
dc.subjectConditional Random Fieldsen_US
dc.subjectSemi-CRFen_US
dc.subjectNeural-CRFen_US
dc.titleEvaluation of Machine Learning Models for Patient Data De-identification in Clinical Recordsen_US
dc.typeThesisen_US
dc.date.defence2018-08-15
dc.contributor.departmentFaculty of Computer Scienceen_US
dc.contributor.degreeMaster of Computer Scienceen_US
dc.contributor.external-examinerN/Aen_US
dc.contributor.graduate-coordinatorMichael McAllisteren_US
dc.contributor.thesis-readerDr. Vlado Keseljen_US
dc.contributor.thesis-readerDr. Srinivas Sampallien_US
dc.contributor.thesis-supervisorDr. Stan Matwinen_US
dc.contributor.thesis-supervisorDr. Aaron Gerowen_US
dc.contributor.ethics-approvalNot Applicableen_US
dc.contributor.manuscriptsNoen_US
dc.contributor.copyright-releaseNoen_US
 Find Full text

Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record